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Preface 


This book follows Bayesian Statistical Modelling (Wiley, 2001) in seeking to make the 
Bayesian approach to data analysis and modelling accessible to a wide range of 
researchers, students and others involved in applied statistical analysis. Bayesian statis- 
tical analysis as implemented by sampling based estimation methods has facilitated the 
analysis of complex multi-faceted problems which are often difficult to tackle using 
‘classical’ likelihood based methods. 

The preferred tool in this book, as in Bayesian Statistical Modelling, is the package 
WINBUGS; this package enables a simplified and flexible approach to modelling in 
which specification of the full conditional densities is not necessary and so small changes 
in program code can achieve a wide variation in modelling options (so, inter alia, 
facilitating sensitivity analysis to likelihood and prior assumptions). As Meyer and Yu 
in the Econometrics Journal (2000, pp. 198-215) state, “any modifications of a model 
including changes of priors and sampling error distributions are readily realised with 
only minor changes of the code.” Other sophisticated Bayesian software for MCMC 
modelling has been developed in packages such as S-Plus, Minitab and Matlab, but is 
likely to require major reprogramming to reflect changes in model assumptions; so my 
own preference remains WINBUGS, despite its possible slower performance and con- 
vergence than tailored made programs. 

There is greater emphasis in the current book on detailed modelling questions such as 
model checking and model choice, and the specification of the defining components (in 
terms of priors and likelihoods) of model variants. While much analytical thought has 
been put into how to choose between two models, say M, and M>, the process 
underlying the specification of the components of each model is subject, especially in 
more complex problems, to a range of choices. Despite an intention to highlight these 
questions of model specification and discrimination, there remains considerable scope 
for the reader to assess sensitivity to alternative priors, and other model components. 
My intention is not to provide fully self-contained analyses with no issues still to resolve. 

The reader will notice many of the usual ‘specimen’ data sets (the Scottish lip cancer 
and the ship damage data come to mind), as well as some more unfamiliar and larger 
data sets. Despite recent advantages in computing power and speed which allow 
estimation via repeated sampling to become a serious option, a full MCMC analysis 
of a large data set, with parallel chains to ensure sample space coverage and enable 
convergence to be monitored, is still a time-consuming affair. 

Some fairly standard divisions between topics (e.g. time series vs panel data analysis) 
have been followed, but there is also an interdisciplinary emphasis which means that 
structural equation techniques (traditionally the domain of psychometrics and educa- 
tional statistics) receive a chapter, as do the techniques of epidemiology. I seek to review 
the main modelling questions and cover recent developments without necessarily going 
into the full range of questions in specifying conditional densities or MCMC sampling 
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options (one of the benefits of WINBUGS means that this is a possible strategy). 
I recognise the ambitiousness of such a broad treatment, which the more cautious 
might not attempt. I am pleased to receive comments (nice and possibly not so nice) 
on the success of this venture, as well as any detailed questions about programs or 
results via e-mail at p.congdon@qmul.ac.uk. The WINBUGS programs that support 
the examples in the book are made available at ftp://ftp.wiley.co.uk/pub/books/congdon. 


Peter Congdon 
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Copyright © 2003 John Wiley & Sons, Ltd. 
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CHAPTER 1 


The Basis for, and Advantages 
of, Bayesian Model Estimation 
via Repeated Sampling 


1.1 INTRODUCTION 


Bayesian analysis of data in the health, social and physical sciences has been greatly 
facilitated in the last decade by advances in computing power and improved scope for 
estimation via iterative sampling methods. Yet the Bayesian perspective, which stresses 
the accumulation of knowledge about parameters in a synthesis of prior knowledge with 
the data at hand, has a longer history. Bayesian methods in econometrics, including 
applications to linear regression, serial correlation in time series, and simultaneous 
equations, have been developed since the 1960s with the seminal work of Box and 
Tiao (1973) and Zellner (1971). Early Bayesian applications in physics are exemplified 
by the work of Jaynes (e.g. Jaynes, 1976) and are discussed, along with recent applica- 
tions, by D’Agostini (1999). Rao (1975) in the context of smoothing exchangeable 
parameters and Berry (1980) in relation to clinical trials exemplify Bayes reasoning in 
biostatistics and biometrics, and it is here that many recent advances have occurred. 

Among the benefits of the Bayesian approach and of recent sampling methods of 
Bayesian estimation (Gelfand and Smith, 1990) are a more natural interpretation of 
parameter intervals, whether called credible or confidence intervals, and the ease with 
which the true parameter density (possibly skew or even multi-modal) may be obtained. 
By contrast, maximum likelihood estimates rely on Normality approximations based on 
large sample asymptotics. The flexibility of Bayesian sampling estimation extends to 
derived or ‘structural’ parameters’ combining model parameters and possibly data, and 
with substantive meaning in application areas (Jackman, 2000), which under classical 
methods might require the delta technique. 

New estimation methods also assist in the application of Bayesian random effects 
models for pooling strength across sets of related units; these have played a major role in 
applications such as analysing spatial disease patterns, small domain estimation for 
survey outcomes (Ghosh and Rao, 1994), and meta-analysis across several studies 
(Smith et al., 1995). Unlike classical techniques, the Bayesian method allows model 
comparison across non-nested alternatives, and again the recent sampling estimation 


' See, for instance, Example 2.8 on geriatric patient length of stay. 
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developments have facilitated new methods of model choice (e.g. Gelfand and Ghosh, 
1998; Chib, 1995). The MCMC methodology may be used to augment the data and this 
provides an analogue to the classical EM method — examples of such data augmentation 
are latent continuous data underlying binary outcomes (Albert and Chib, 1993) and the 
multinomial group membership indicators (equalling 1 if subject 7 belongs to group j) 
that underlie parametric mixtures. In fact, a sampling-based analysis may be made 
easier by introducing this extra data — an example is the item analysis model involving 
‘guessing parameters’ (Sahu, 2001). 


1.1.1 Priors for parameters 


In classical inference the sample data y are taken as random while population param- 
eters 0, of dimension p, are taken as fixed. In Bayesian analysis, parameters themselves 
follow a probability distribution, knowledge about which (before considering the data 
at hand) is summarised in a prior distribution 7(8). In many situations, it might be 
beneficial to include in this prior density the available cumulative evidence about a 
parameter from previous scientific studies (e.g. an odds ratio relating the effect of 
smoking over five cigarettes daily through pregnancy on infant birthweight below 
2500 g). This might be obtained by a formal or informal meta-analysis of existing 
studies. A range of other methods exist to determine or elicit subjective priors (Berger, 
1985, Chapter 3; O’Hagan, 1994, Chapter 6). For example, the histogram method 
divides the range of 0 into a set of intervals (or “bins’) and uses the subjective probability 
of 0 lying in each interval; from this set of probabilities, 7(8) may then be represented as 
a discrete prior or converted to a smooth density. Another technique uses prior esti- 
mates of moments, for instance in a Normal N(m, V) density” with prior estimates m 
and V of the mean and variance. 

Often, a prior amounts to a form of modelling assumption or hypothesis about the 
nature of parameters, for example, in random effects models. Thus, small area death 
rate models may include spatially correlated random effects, exchangeable random 
effects with no spatial pattern, or both. A prior specifying the errors as spatially 
correlated is likely to be a working model assumption, rather than a true cumulation 
of knowledge. 

In many situations, existing knowledge may be difficult to summarise or elicit in the 
form of an ‘informative prior’ and to reflect such essentially prior ignorance, resort is 
made to non-informative priors. Examples are flat priors (e.g. that a parameter is 
uniformly distributed between —oo and oo) and Jeffreys prior 


a7(0) x det{7(@)}°° 


where Z(0) is the expected information? matrix. It is possible that a prior is improper 
(doesn’t integrate to 1 over its range). Such priors may add to identifiability problems 
(Gelfand and Sahu, 1999), and so many studies prefer to adopt minimally informative 
priors which are ‘just proper’. This strategy is considered below in terms of possible 
prior densities to adopt for the variance or its inverse. An example for a parameter 


? In fact, when 9 is univariate over the entire real line then the Normal density is the maximum entropy prior 
according to Jaynes (1968); the Normal density has maximum entropy among the class of densities identified 
by a summary consisting of mean and variance. 


37 4(0) \ 


3 Tf (8) = log (L(6)) then 7(0) = ~E SREB) 
((0;)5L(8; 
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distributed over all real values might be a Normal with mean zero and large variance. 
To adequately reflect prior ignorance while avoiding impropriety, Spiegelhalter et al. 
(1996) suggesting a prior standard deviation at least an order of magnitude greater than 
the posterior standard deviation. 


1.1.2 Posterior density vs. likelihood 


In classical approaches such as maximum likelihood, inference is based on the 
likelihood of the data alone. In Bayesian models, the likelihood of the observed data 
y given parameters 6, denoted f(y|6) or equivalently L(6|y), is used to modify the 
prior beliefs 7(8), with the updated knowledge summarised in a posterior density, 
a(6|y). The relationship between these densities follows from standard probability 
equations. Thus 


f(y, 9) = fOO) = 7(6|y)m(y) 


and therefore the posterior density can be written 


TOY) = f(y|0)m(0)/m(y) 


The denominator m( y) is known as the marginal likelihood of the data and found by 
integrating (or ‘marginalising’) the likelihood over the prior densities 


m(y) = [romao 


This quantity plays a central role in some approaches to Bayesian model choice, but for 
the present purpose can be seen as a proportionality factor, so that 


Toly) x fC y]8)7(8) (1.1) 


Thus, updated beliefs are a function of prior knowledge and the sample data evidence. 
From the Bayesian perspective the likelihood is viewed as a function of 0 given fixed 
data y, and so elements in the likelihood that are not functions of 0 become part of the 
proportionality in Equation (1.1). 


1.1.3 Predictions 


The principle of updating extends to future values or predictions of ‘new data’. 
Before the study a prediction would be based on random draws from the prior 
density of parameters and is likely to have little precision. Part of the goal of the a 
new study is to use the data as a basis for making improved predictions ‘out of 
sample’. Thus, in a meta-analysis of mortality odds ratios (for a new as against 
conventional therapy) it may be useful to assess the likely odds ratio z in a 
hypothetical future study on the basis of the observed study findings. Such a 
prediction is based is based on the likelihood of z averaged over the posterior density 
based on y: 


feb) = [Felony 


where the likelihood of z, namely f (z|0) usually takes the same form as adopted for the 
observations themselves. 
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One may also take predictive samples order to assess the model performance. A 
particular instance of this, useful in model assessment (see Chapters 2 and 3), is in 
cross-validation based on omitting a single case. Data for case i is observed, but a 
prediction of y; is nevertheless made on the basis of the remaining data 
Yig = (V1. Y2» - -Yi-1; Yi 15+. Vn}. Thus in a regression example with covariates x;, the 
prediction z; would be made based on a model fitted to yjq; a typical example might 
be a time series model for t = 1,..n, including covariates that are functions of time, 
where the model is fitted only up to i=n—1 (the likelihood is defined only for 
i=1,..n— 1), and the prediction for i= n is based on the updated time functions. 
The success of a model is then based on the match between the replicate and actual data. 

One may also derive 


fOr) = [Looma 


namely the probability of y; given a model based on the data excluding it (Gelfand et al., 
1992). This is known as the Conditional Predictive Ordinate (CPO) and has a role in 
model diagnostics (see Section 1.5). For example, a set of count data (without covari- 
ates) could be modelled as Poisson (with case i excluded) leading to a mean 6j. The 
Poisson probability of case i could then be evaluated in terms of that parameter. 

This type of approach (n-fold cross-validation) may be computationally expensive 
except in small samples. Another option is for a large dataset to be randomly divided 
into a small number k of groups; then cross-validation may be applied to each partition 
of the data, with k — 1 groups as ‘training’ sample and the remaining group as the 
validation sample (Alqalaff and Gustafson, 2001). For large datasets, one might take 
50% of the data as the training sample and the remainder as the validation sample (i.e. 
k = 2). 

One may also sample new or replicate data based on a model fitted to all observed 
cases. For instance, in a regression application with predictors x; for case i, a prediction 
z; would make use of the estimated regression parameters B and the predictors as they 
are incorporated in the regression means, for example p; = x;8 for a linear regression 
These predictions may be used in model choice criteria such as those of Gelfand and 
Ghosh (1998) and the expected predictive deviance of Carlin and Louis (1996). 


1.1.4 Sampling parameters 


To update knowledge about the parameters requires that one can sample from the 
posterior density. From the viewpoint of sampling from the density of a particular 
parameter 0x, it follows from Equation (1.1) that aspects of the likelihood which are not 
functions of 0 may be omitted. Thus, consider a binomial example with r successes from 
n trials, and with unknown parameter p representing the binomial probability, with a 
beta prior B(a, b), where the beta density is 


F(a b) 

Trab) 

The likelihood is then, viewed as a function of p, proportional to a beta density, namely 
So) xpa- p” 


and the posterior density for p is then a beta density with parameters r a and 
n b-r: 


pa! a py 
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Therefore, the parameter’s posterior density may be obtained by sampling from the 
relevant beta density, as discussed below. Incidentally, this example shows how the prior 
may in effect be seen to provide a prior sample, here of sizea b — 2, the size of which 
increases with the confidence attached to the prior belief. For instance, if a = b = 2, 
then the prior is equivalent to a prior sample of 1 success and 1 failure. 

In Equation (1.2), a simple analytic result provides a method for sampling of 
the unknown parameter. This is an example where the prior and the likelihood 
are conjugate since both the prior and posterior density are of the same type. In 
more general situations, with many parameters in @ and with possibly non- 
conjugate priors, the goal is to summarise the marginal posterior of a particular 
parameter 6; given the data. This involves integrating out all the parameters but 
this one 


Pob) = | PO, sacks Ogas Ok 1,--9p|y)d01 . . . dOk-ıdOk 1...d8, 


Such integrations in the past involved demanding methods such as numerical quadra- 
ture. 

Monte Carlo Markov Chain (MCMC) methods, by contrast, use various techniques 
which ultimately amount to simulating repeatedly from the joint posterior of all the 
parameters 


P(01, 82, ... Oly) 


without undertaking such integrations. However, inferences about the form of the 
parameter densities are complicated by the fact that the samples are correlated. Suppose 
S samples are taken from the joint posterior via MCMC sampling, then marginal 
posteriors for, say, 9, may be estimated by averaging over the S samples 
0x1, 942---- Oks. For example, the mean of the posterior density may be taken as the 
average of the samples, and the quantiles of the posterior density are given by the 
relevant points from the ranked sample values. 


1.2 GIBBS SAMPLING 


One MCMC algorithm is known as Gibbs sampling’, and involves successive sampling 
from the complete conditional densities 


P(Ox\y, 91, ... Ox-1, 0k [ia cesses 8,) 


which condition on both the data and the other parameters. Such successive samples 
may involve simple sampling from standard densities (gamma, Normal, Student f¢, etc.) 
or sampling from non-standard densities. If the full conditionals are non-standard but 
of a certain mathematical form (log-concave), then adaptive rejection sampling (Gilks 
and Wild, 1992) may be used within the Gibbs sampling for those parameters. In other 
cases, alternative schemes based on the Metropolis—Hastings algorithm, may be used to 
sample from non-standard densities (Morgan, 2000). The program WINBUGS may be 
applied with some or all parameters sampled from formally coded conditional densities; 


* This is the default algorithm in BUGS. 
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however, provided with prior and likelihood WINBUGS will infer the correct condi- 
tional densities using directed acyclic graphs”. 

In some instances, the full conditionals may be converted to simpler forms by 
introducing latent data w;, either continuous or discrete (this is known as ‘data aug- 
mentation’). An example is the approach of Albert and Chib (1993) to the probit model 
for binary data, where continuous latent variables w; underlie the observed binary 
outcome y;. Thus the formulation 


Wi = Bx; Ui with uUi ~ N(0, 1) 
yi = I(w; > 0) 


is equivalent to the probit model®. Latent data are also useful for simplifying survival 
models where the missing failure times of censored cases are latent variables (see Example 
1.2 and Chapter 9), and in discrete mixture regressions, where the latent categorical 
variable for each case is the group indicator specifying to which that case belongs. 


1.2.1 Multiparameter model for Poisson data 


As an example of a multi-parameter problem, consider Poisson data y; with means )\j, 
which are themselves drawn from a higher stage density. This is an example of a mixture 
of densities which might be used if the data were overdispersed in relation to Poisson 
assumptions. For instance, if the à; are gamma then the y; follow a marginal density 
which is negative binomial. Suppose the à; are drawn from a Gamma density with 
parameters a and B, which are themselves unknown parameters (known as hyperpara- 
meters). So 


yi ~ Poi(rj) 
File, B) = A$ te PNB? /T (0) 


Suppose the prior densities assumed for «a and B are, respectively, an exponential’ with 
parameter a and a gamma with parameters {b, c}, so that 


5 Estimation via BUGS involves checking the syntax of the program code (which is enclosed in a model file), 
reading in the data, and then compiling. Each statement involves either a relation ~ (meaning distributed as) 
which corresponds to solid arrows in a directed acyclic graph, or a deterministic relation <- which corresponds 
to a hollow arrow in the DAG. Model checking, data input and compilation involve the model menu in 
WINBUGS- though models may also be constructed directly by graphical means. The number of chains (if in 
excess of one) needs to be specified before compilation. If the compilation is successful the initial parameter 
value file or files (‘inits files’) are read in. If, say, three parallel chains are being run three inits files are needed. 
Syntax checking involves highlighting the entire model code, or just the first few letters of the word model, and 
then choosing the sequence model/specification/check model. To load a data file either the whole file is 
highlighted or just the first few letters of the word ‘list’. For ascii data files the first few letters of the first 
vector name need to be highlighted. Several separate data files may be read in if needed. After compilation the 
inits file (or files) need not necessarily contain initial values for all the parameters and some may be randomly 
generated from the priors using ‘gen inits’. Sometimes doing this may produce aberrant values which lead to 
numerical overflow, and generating inits is generally excluded for precision parameters. An expert system 
chooses the sampling method, opting for standard Gibbs sampling if conjugacy is identified, and for adaptive 
rejection sampling (Gilks and Wild, 1992) for non-conjugate problems with log-concave sampling densities. 
For non-conjugate problems without log-concavity, Metropolis—Hastings updating is used, either slice sam- 
pling (Neal, 1997) or adaptive sampling (Gilks et al., 1998). To monitor parameters (i.e. obtain estimates from 
averaging over sampled values) go inference/samples and enter the relevant parameter name. For parameters 
which would require extensive storage to be monitored fully an abbreviated summary (for say the model means 
of all observations in large samples, as required for subsequent calculation of model fit formulas) is obtained 
by inference/summary and then entering the relevant parameter name. 

€ J(u) is 1 if u holds and zero otherwise. 

7 The exponential density with parameter 0 is equivalent to the gamma density G(0, 1). 
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a ~ E(a) 
B ~ G(b, c) 


where a, b and c are taken as constants with known values (or briefly ‘taken as known’). 
Then the posterior density of 0 = (Ni. . Nn, &, B) is 


fM... An, Q, Bly,- -- Yn) ax 


eB b-1 eB f B°/T(a)]" Tes TPs IQ (1.3) 


If elements of this density which do not involve the à; are regarded as constants, it 
can be seen that the conditional density of the à; is a gamma with parameters y; a 
and B 1. Similarly, disregarding elements not functions of B, the conditional 
density of B is gamma with parametersb naandc dj. The full conditional density 
of a is 


n a-l 
Faly, B, A) œ e “1B /T(a)]" (11 ~) 
l 


This density is non-standard but log-concave (see George et al., 1993). Adaptive rejec- 
tion sampling might then be used, and this is the default in BUGS, for example. Another 
option is to establish a grid of probability values according to possible values 
aj(j=1,...J) of a; this is described as ‘griddy Gibbs’ by Tanner (1993). At each 
iteration the densities at each value of a are calculated, namely 


y= e~“ [BY /T(a,)]" (É N J 


and then scaled to sum to 1, with the choice among possible values a; decided by a 
categorical indicator. In practice, a preliminary run might be used to ascertain the 
support for a, namely the range of values across which its density is significant, and 
so define a reasonable grid a;, j= 1,..J. 

If the Poisson counts (e.g. deaths, component failures) are based on different expos- 
ures E; (populations, operating time), then 


yi ~ Poi(E;yj) 


aj—1 


and the posterior density in Equation (1.3) is revised to 


fai, Paa An, Q, Bly. a Vn) x 
e @BP-1e- BIB /T(a)]" II eTEN II yale Ba 
i=] i=l 


(Note that E; raised to the power y; drops out as a constant.) Then the conditional 
density of the à; is a gamma with parameters a y; and B &;. The conditional 
densities of a and B are as above. 


Example 1.1 Consider the power pumps failure data of Gaver and O’Muircheartaigh 
(1987), where failures y; of the ith pump, operating for time £; are Poisson with 


yi ~ Poi(k;) 
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where k; = E;\;. The data are as follows: 


x 


Pump Ei 


1 94.5 
2 15.7 
3 62.9 
4 126 

5 5.24 
6 31.4 
7 

8 

9 

0 


— 


— 


1.05 

1.05 

2.1 
10.5 


NBR RK OWBRUNRK YN 


= 
N 


The BUGS coding for direct sampling from the full conditionals including the grid prior 
on a is as follows: 


model {for (i in 1 : n) { lambda[i] ~ dgamma(A[i],B[il]) 
A[i] <- alpha y[i]; B[i] <- beta E[i] 
log.lam[i] <- log(lambda[i])} 
# product lambda[1]*lambda[2] . . .*lambda[N] obtained by exponentiating sum 
# of log lambda’s 
Prod. lam <- exp(sum(log.lam[])) 
for (j in 1:50) {alph[j] <- j*0.05; 
# grid based on conditional posterior of alpha; prior of alpha an E(1) 
G[j] <- exp(-alph[j)*pow(pow(beta,alph[j])/exp (loggam(alph[j])),N) 
* pow(Prod.lam,alph[j]—1) 
# scale grid probabilities to sum to 1 
G.Pr[j] <- G[j]/sum(G[])} 
# sample from grid (discrete prior) 
k.alph ~ dcat (G.Pr[]) 
alpha <- alph[k.alph] 
C <- n“alpha 0.1 
D<-1 sum(lambda[]) 
beta ~ dgamma(C, D)} 


The grid for a ranges between 0.05 and 2.5. This coding adopts the priors used by 
George et al. (1993), namely, an exponential density with known parameter 1 for a, and 
a Gamma(0.1, 1) prior for B. The inits file just contains initial values for a and B, while 
those of à; may be generated using ‘gen inits’. Then the posterior means and standard 
deviations of a and B, from a single long chain of 50000 iterations with 5000 burn in, 
are 0.70 (0.27) and 0.94 (0.54). 

However, this coding may be avoided by specifying just the priors and likelihood, as 
follows: 


model {for (iini:n) { 
lambda[i] ~ dgamma(alpha, beta) 
kappa[i] <- lambda[i]* E[i] 
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y[i] ~ dpois(kappa[i}} 
alpha ~ dexp(1) 
beta ~ dgamma (0.1, 1.0)} 


1.2.2 Survival data with latent observations 


As a second example, consider survival data assumed to follow a Normal density — note 
that usually survival data are non-Normal. Survival data, and more generally event 
history data, provide the most familiar examples of censoring. This occurs if at the 
termination of observation, certain subjects are (right) censored in that they have yet to 
undergo the event (e.g. a clinical end-point), and their unknown duration or survival 
time ź is therefore not observed. Instead, the observation is the time ¢* at which 
observation of the process ceased, and for censored cases, it must be the case that 
t > t*. The unknown survival times for censored subjects provide additional unknowns 
(as augmented or latent data) to be estimated. 

For the Normal density, the unknown distributional parameters are the mean and 
variance {1, o°}. In Bayesian modelling there are potential simplifications in consider- 
ing the specification of the prior, and updating to the posterior, in terms of the inverse 
of the variance, or precision, t= o~*. Since both the variance and precision are 
necessarily positive, an appropriate prior density is constrained to positive values. 
Though improper reference priors for the variance or precision are often used, 
consider prior densities P(t), which are proper in the sense that the integral 
over possible values is defined. These include the uniform density over a finite range, 
such as 


Tt ~ U(O, 1000) 
or a gamma density which allows for various types of skewness. This has the form 
T~ G(f, 8) 
so that 
P(t) x 1%! exp(— gt) (1.4) 


where f and g are taken as known constants, and where the prior mean of 7 is then f/g 
with variance f /g°. For instance, taking f = 1, g = 0.001 gives a prior on t which still 
integrates to 1 (is proper) but is quite diffuse in the sense of not favouring any value. A 
similar diffuse prior takes f = g = 0.001 or some other common small value. Substi- 
tuting f = 1 and g = 0.001 in Equation (1.4) shows that for these values of f and g the 
prior in (1.4) is approximately (but not quite) 


P(t) x 1 (1.5) 
Setting g = 0 is an example of an improper prior, since then P(t) « 1, and 


[Poar = 


So taking f = 1, g = 0.001 in Equation (1.4) represents a ‘just proper’ prior. 
In fact, improper priors are not necessarily inadmissible for drawing valid inferences 
providing the posterior density, given by the product of prior and likelihood, as in 


8 Tn this case, the prior on 7 is approximately P(t) « 1/7. 
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Equation (1.1), remains proper (Fraser et al., 1997). Certain improper priors may 
qualify as reference priors, in that they provide minimal information (for example, 
that the variance or precision is positive), still lead to proper posterior densities, and 
also have valuable analytic properties, such as invariance under transformation. An 
example is the Jeffreys prior for o = 7°, namely 


P(o) = 1/0 


In BUGS, priors in this form may be implemented over a finite range using a discrete 
grid method and then scaling the probabilities to sum to 1. This preserves the shape 
implications of the prior, though obviously they are no longer improper. 


1.2.3 Natural conjugate prior 


In a model with constant mean p over all cases, a joint prior for {u, T}, known as the 
‘natural conjugate’ prior, may be specified for Normally or Student ¢ distributed data 
which assumes a Gamma form for t, and a conditional prior distribution for p given T 
which is Normal. Thus, the prior takes the form 


P(u, 7) = P(t) P(u|7) 


One way to specify the prior for the precision r is in terms of a prior ‘guess’ at the 
variance Vo and a prior sample size v (possibly non-integer) which represents the 
strength of belief (usually slight) in this guess. Typical values are v = 2 or lower. 
Then the prior for t takes the form 


t ~ G(v/2, vVo/2) (1.6) 
and taking Vp = 0.001 and v = 2, gives a ‘just proper’ prior 
t ~ GC, 0.001) 


as discussed earlier. Given the values of t drawn from this prior, the prior for p takes the 
form 


u ~ N(Mp, (197) ') 


where Mp is a prior guess at the unknown mean of the Normal density. Since higher 
values of the precision nọt mean lower variances, it can be seen that higher values of no 
imply greater confidence in this guess. Usually, no is taken small (1 or less, as for v). So 
the entire prior has the form 


P(u, 7) x 7°? exp { — 0.5noT(u — Mo)? } exp (— 0.5vVot)1°>""! 


1.2.4 Posterior density with Normal survival data 


In the survival example, suppose initially there is only one group of survival times, and 
that all times are known (i.e. there is no censoring). Let the observed mean survival 


time be 
ee 
M=n">) s 


and observed variance be 
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V = X(t; -— MP /™—1) 


Then one may show that the posterior density of {p,7} given data {t1,f2,.... tn} 1s 
proportional to the product of 


1. A Normal density for u with precision nıt, where nı = no n, and with mean M,, 
which is a weighted average of data and prior means, namely Mı = wọoMoọo wiM, 
with weights wo = no/m and w; = n/n; and 

2. A Gamma density for t of the form in Equation (1.6) which has ‘sample size’ 
vj} =v nand variance 


Vi=( ny'[Vov mM? (n—1)V nM? —71,M?] 


Thus 
P(t, ply) x °° exp[ — 0.5714 (u — M,)"] 7°"! exp (— 0.5771 V1) 


The Gibbs sampling approach considers the distributions for t and p conditional on 
the data and the just sampled value of the other. The full conditional for u (regarding + 
as a constant) can be seen to be a Normal with mean M; and precision tn. Then just 
having drawn p at iteration ¢, the next iteration samples from the full conditional 
density for t which is a gamma density with shape (first parameter) 0.5v, 0.5 and 
scale 0.5[vı V] (yw — M°]. 

If some event times were in fact censored when observation ceased, then these are 
extra parameters drawn from the Normal density with mean p and precision 7 subject to 
being at least ¢*. That is, constrained sampling from the Normal above (with mean p 
and precision 7 at the tth iteration) is used, disregarding sampled values which are 
lower than t*. The subsequently updated values of M and V include these imputations 
as well as the uncensored {;. 

It can be seen that even for a relatively standard problem, namely updating the 
parameters of a Normal density, the direct coding in terms of full conditional densities 
becomes quite complex. The advantage with BUGS is that it is only necessary to specify 
the priors 


t ~ G(v/2, Vov/2) 
u ~ N(Mo, (not) ") 
and the form of the likelihood for the data, namely 
ti~ N(p, T7!) (t uncensored) 
ti ~ N(w, T7!) I(t;*,) (censoring at t*) 


and the full conditionals are inferred. The /(a, b) symbol denotes a range within which 
sampling is confined. For uncensored data the t; are observed, but for the censored 
data the observations are ¢;* and the true t; are latent (or ‘augmented’, or ‘missing’) 
data. 


Example 1.2 Leukaemia remission times Consider the frequently analysed data of 
Gehan (1965) on leukaemia remission times under two therapies, the latter denoted 77[ ] 
in the code below with Tr[i] = 1 for the new treatment. Here delayed remission (longer 
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survival) indicates a better clinical outcome. There is extensive censoring of times under 
the new therapy, with censored times coded as NA, and sampled to have minimum 
defined by the censored remission time. 


Assume independent Normal densities differing in mean and variance according to 
treatment, and priors 


w; ~ N(0, T7’) with m = 1 
q; ~ G(1, 0.001) 


in treatment groups j. Then the code for BUGS (which parameterises the Normal with 
the inverse variance) and with the data from Gehan, is as follows: 


model { for (i in 1:42) { t[i] ~ dnorm(mu[Tr[i]], tau[Tr[i]]) I(min[i], )} 
for (j in 1:2) {mu[j] ~ dnorm(0, tau[j]) 
# inverse variances 
tau[j] ~ dgamma(1,0.001)} 


with data file 
list(t=c(NA,6,6,6,7,NA, NA,10,NA,13,16,NA,NA,NA,22,23,NA,NA,NA, 
NA,NA,1,1, 2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23), 
min=c(6,0,0,0,0,9,10,0,11,0,0,17,19,20,0,0,25,32,32,34,35,0,0, 
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),Tr=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 
1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)) 
The inits file 
list(mu=c(10,10),tau=c(1,1)) 
specifies the initial Normal parameters only, and the missing survival times (the 
remaining unknowns) may be sampled from the prior (using “gen inits’). The average 
remission times under new and old therapies are 24.8 and 8.3 months. An example of a 
skew posterior density is provided by the imputed survival times for censored subjects. 
While an unspecified upper sampling limit for censored times is one option, there may 
be subject matter considerations ruling out unusually high values (e.g. survival 
exceeding five years). Since setting nọ = 1 (or any other value) may appear arbitrary, 
one may also assume independent priors for w, and q; (Gelman et al., 1995, Chapter 3), 
such as p; ~ N(0, Kj), with K; typically large, and 1; ~ G(1,0.001) as before. With 
K; = 1000, the average remission times become 25.4 and 8.6. 


1.3 SIMULATING RANDOM VARIABLES FROM STANDARD DENSITIES 


Parameter estimation by MCMC methods and other sampling-based techniques re- 
quires simulated values of random variables from a range of densities. As pointed out by 
Morgan (2000), sampling from the uniform density U(0, 1) is the building block for 
sampling the more complex densities; in BUGS 0, 1 this involves the code 


U ~ dunif(0, 1) 
Thus, the Normal univariate density’ is characterised by a mean p and variance p, with 
? BUGS parameterises the Normal in terms of the inverse variance, so priors are specified on P = 7! and p, 


and samples of @ may be obtained by specifying @ = P~!. With typical priors on p and P, this involves the 
coding 
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X ~ N(p, >) 


A sample from Normal density with mean 0 and variance 1 may be obtained by 
considering two independent draws U; and U from a U(0, 1) density. Then with 
m ~ 3.1416, the pair 


Zı =[—2 In (U1)? sin (2r U2) 
Z =[—2 In(Uj)]°° cos (2m U2) 


are independent draws from an N(0,1) density. Then using either of these draws (say 
Z = Z), a sample from N(p, p) is obtained via 


X=p Zy 


An approximately Normal N(0, 1) variable may also be obtained using central limit 
theorem ideas: take n draws U1, U2, .. U, from a U(0, 1) then 


X= (= TE osn) (2) 


is approximately N(0, 1) for large n. In fact n = 12 is often large enough and simplifies 
the form of X. 


1.3.1 Binomial and negative binomial 


Another simple application of sampling from the uniform U(0, 1) is if a sample of an 
outcome Y; (either 0 or 1) from a Bernoulli density with probability p is required. Thus, 
if U; is a sample from U(0, 1) and U; < p then Y; is taken as 1, whereas U; > p leads to 
Y; = 0. So the unit interval is in effect split into sections of length p and 1 — p. This 
principle can be extending to simulating ‘success’ counts r from a binomial with n 
subjects at risk of an event with probability p. The sampling from U(0, 1) is repeated 
n times and the number of times for which U; < p is the simulated success count. 
Similarly, consider the negative binomial density, with 


prix) = ("7 a-o x=rr lr 2.. 


In this case a sequence U4, U2, ... may be drawn from the U(0,1) density until r of them 
are less than or equal to p, with x given by the number of draws U; needed to reach this 
threshold. 


1.3.2 Inversion method 


A further fundamental building block based on the uniform density follows from the 
fact that if U; is a draw from U(0, 1) then 


X; = —1/pIn(U)) 
is a draw from an exponential’ with mean p. The exponential density is defined by 
phi <- 1/P 
X ~ dnorm (mu, P) 


mu ~ dnorm(0,0.001) 
P ~ dgamma(0.001,0.001) . 


10 In BUGS the appropriate code is x ~dexp(mu). 


14 BAYESIAN MODEL ESTIMATION VIA REPEATED SAMPLING 


F(x) = wexp (— px) 


with mean 1/ and variance 1/7, and is often a baseline model for waiting or inter- 
event times. 

This way of sampling the exponential is an example of the inversion method for 
simulation of a continuous variable with distribution function F, the inverse of which is 
readily available. If u ~ U (0, 1) then 


Pr[F-!(u) < x] = Pr[u < F(x)] = F(x) 


and the quantities x = F~'(u) are then draws from a random variable with cumulative 
density F(x). The same principle may be used to obtain draws from a logistic distribu- 
tion x ~ Logistic(w, 7), a heavy tailed density (as compared to the Normal) with cdf 


F(x) =1/{1 eae 
and pdf 
f(x) = ree) IT] ete wp 


This distribution has mean p, variance n?/(31?), and a draw may be obtained by the 
transformation 


x = log, (U/[1 — UJ)/7 — p. 
The Pareto, with density 
fd) = ab*/pf t] x>b>0 
may be obtained as 
x =b/(1—vu)!/4 
or equivalently, 


x= b/ue, 


1.3.3 Further uses of exponential samples 


Simulating a draw x from a Poisson with mean pu can be achieved by sampling 
U; ~ U(0, 1) and taking x as the maximum n for which the cumulative sum of 
Li = -ln (U;), 


S= D nh 


remains below p. From above, the L; are exponential with rate 1, and so viewed as inter- 
event times of a Poisson process with rate 1, N = N(w) equals the number of events 
which have occurred by time p. Equivalently, x is given by n, wheren 1 draws from an 
exponential density with parameter u are required for the sum of the draws to first 
exceed 1. 

The Weibull density is a generalisation of the exponential also useful in event history 
analysis. Thus, if £ ~ Weib(a, A), then 


f(t) =ar exp(— A), t>0 
If x is exponential with rate \, then ż = x!/“ is Weib(a, A). Thus in BUGS the codings 
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ti] ~ dweib(alpha,lambda) 
and 

x[i] ~ dexp(lambda) 

ti] <- pow(zļ{i],1/alpha) 
generate the same density. 


1.3.4 Gamma, chi-square and beta densities 


The gamma density is central to the modelling of variances in Bayesian analysis, and as 
a prior for the Poisson mean. It has the form 


f(x) = [B*/T@]x exp(— Bx), x>0 


with mean a/® and variance a/ß?. Several schemes are available for generating 
a gamma variate. For a= K an integer, drawing a sample x1, x2,.. Xg from an 
exponential with mean B and then taking the sum y= X¥!x; generates a draw 
from a Gamma density. Note also that if x ~ G(a, 1), then y = x/B is a G(a, B) 
variable. 

Since a G(a, B) density is often used as a prior for a precision parameter, it is also 
worth noting that it follows that the variance then follows an inverse gamma density, 
with the same parameters. The inverse gamma has the form 


f(x) = [B*/r(@]x *" exp(— B/x), x>0 


with mean B/(a — 1) and variance 87/[(a — 1) (a — 2)]. One possibility (for x approxi- 
mately Normal) for setting a prior on the variance is to take the prior mean of g? to be 
the square of one sixth of the anticipated range of x (since the range is approximately 60 
for a Normal variable). Then for a = 2 (or just exceeding 2 to ensure finite variance), 
one might set B = (range/6)”. 

From the gamma density may be derived a number of other densities, and hence 
ways of sampling from them. The chi-square is also used as a prior for the variance, 
and is the same as a gamma density with a = v/2, B = 0.5. Its expectation is then 
v, usually interpreted as a degrees of freedom parameter. The density (1.6) above 
is sometimes known as a scaled chi-square. The chi-square may also be obtained for 
v an integer, by taking v draws x1, x2,...Xy from an N(0,1) density and taking the sum of 
XT, X3,..x2. This sum is chi-square with v degrees of freedom. 

The beta density is used as a prior for the probability p in the binomial density, and 
can accommodate various degrees of left and right skewness. It has the form 


f(p)=T@ P/E ad- a, B>0;0<p<1 
with mean a/(a 8). Setting a = B implies a symmetrical density with mean 0.5, 
whereas a > B implies positive skewness and a < B implies negative skewness. The 
totala  —2 defines a prior sample size as in Equation (1.2). 


If y and x are gamma densities with equal scale parameters (say w = 1), and if 
y ~ Gla, w) and x ~ G(B, œ), then 


x=y/(y x) 


is a B(a, B) density. The beta has mean a/(a B) and variance aß/[(a 8B) 
(a B D) 
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1.3.5 Univariate and Multivariate t 


For continuous data, the Student ¢ density is a heavy tailed alternative to the Normal, 
though still symmetric, and is more robust to outlier points. The heaviness of the tails is 
governed by an additional degrees of freedom parameter v as compared to the Normal 
density. It has the form 


f(x) =T(0.5v — 0.5)/[P(0.5v)(o7v7)"9] [L(x — p)?/(vo?) D 


with mean p and variance vo?/(v — 2). If z is a draw from a standard Normal, N(0, 1), 
and y is a draw from the Gamma G(0.5v, 0.5) density, then x= u øozyv/vy is a draw 
form a Student ¢,(y, o°). Equivalently, let y be a draw from a Gamma density, 
G(0.5v,0.5v), then the Student ¢ is obtained by sampling from N(, o?/y). The latter 
scheme is the best form for generating the scale mixture version of the Student ¢ density 
(see Chapter 2). 

A similar relationship holds between the multivariate Normal and multivariate t 
densities. Let x be a d-dimensional continuous outcome. Suppose x is multivariate 
Normal with mean p = (py, bo,-- yg) and d x d dispersion matrix V. This is denoted 
x~ MVN(p, V) or x~ Nalu, V), and 


f(x) = Qn) P yv exp[ — 0.5(X — p) VEX — p)] 


Sampling from this density involves the Cholesky decomposition!’ of V, namely 
V = AAF", where A is also d x d. Then if z1, Z2, .. Za are independent univariate draws 
from a standard Normal, 


x=pu Az 


is a draw from the multivariate Normal. The multivariate Student ¢ with v degrees of 
freedom, mean p = (M1, M2,- - pg) and dispersion matrix V is defined by 


F@)= KIVA Aya — WV — wpe ® 


where K is a constant ensuring the integrated density sums to unity. This density is 
useful for multivariate data with outliers or other sources of heavy tails, and may be 
sampled from by taking a single draw Y from a Gamma density, A ~ G(0.5v, 0.5v) and 
then sampling the vector 


x~ Nale, V/A) 


The Wishart density, the multivariate generalisation’? of the gamma or of the chi-square 
density, is the most common prior structure assumed for the inverse of the dispersion 


"| This matrix may be obtained (following an initialisation of A) as: 
for i= l to d 
forj=1toi-1 


k=1 


jul 
Aj = (% a Sa / Ay 


i-l 0.5 
Ai = (r = 554) 
k=1 


12 Different parameterisations are possible. The form in WINBUGS generalises the chi-square. 
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matrix V, namely the precision matrix T = V~!. One form for this density, for a degrees 
of freedom v > d, and a scale matrix S 


T x (S|? |T/°°°-© exp ( — 0.5tr[ST]) 


The matrix S/v is a prior guess at the dispersion matrix, since E(T) = vS!. 


1.3.6 Densities relevant to multinomial data 


The multivariate generalisation of the Bernoulli and binomial densities allows for 
a choice among C > 2 categories, with probabilities 7, m2,.. mc summing to 1. 
In BUGS the multivariate generalisation of the Bernoulli may be sampled from in 
two ways: 


Y[i] ~ deat(pi[1: C]) 
which generates a choice j between 1 and C, or 
Z[i] ~ dmulti(pi[1: C], 1) 


This generates a choice indicator, Z; = 1 if the jth category is chosen, and Z; = 0 
otherwise. 
For example, the code 
{for (i in 1:100) {Y[i] ~ deat(pi[1:3])}} 
with data in the list file 
list( pi=c(0.8,0.1,0.1)} 
would on average generate 80 one’s, 10 two’s and 10 three’s. The coding 
{for (i in 1:100) £Y[i,1:3] ~ dmulti(pi[1:3],1)}} 
with data as above would generate a 100 x 3 matrix, with each row containing a one 
and two zeroes, and the first column of each row being 1 for 8 out of 10 times on 
average. 
A commonly used prior for the probability vector [] = (m1, . . mc) is provided by the 
Dirichlet density. This is a multivariate generalisation of the beta density, as can be seen 
from its density 


f(m,.., 7c) =T(a Q2 ae ac)/(P(a)l(e2)...T(ac)] 
eae al “es one 


where the parameters a1, @2,..a@c¢ are positive. The Dirichlet may be drawn from 
directly in WINBUGS and a common default option sets a; = a2 = ...ac = 1. How- 
ever, an alternative way of generation is sometimes useful. Thus, if Z|, Z2,..Zc are 
gamma densities with equal scale parameters (say w = 1), and if 


Z| ~ Ga, œ), Z2 ~ Gaz, w), .. Zc ~ Glac, w) 


then the quantities 
Z; =0;/_ ox FH hC: 
k 


are draws from the Dirichlet with prior weights vector (a1, .. ac). 
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1.4 MONITORING MCMC CHAINS AND ASSESSING CONVERGENCE 


An important practical issue involves assessment of convergence of the sampling 
process used to estimate parameters, or more precisely update their densities. In con- 
trast to convergence of optimising algorithms (maximum likelihood or minimum least 
squares, say), convergence here is used in the sense of convergence to a density rather 
than single point. The limiting or equilibrium distribution P(6| Y) is known as the target 
density. The sample space is then the multidimensional density in p-space; for instance, 
if p = 2 this density may be approximately an ellipse in shape. 

The above two worked examples involved single chains, but it is preferable in 
achieving convergence to use two or more parallel chains'* to ensure a complete 
coverage of this sample space, and lessen the chance that the sampling will become 
trapped in a relatively small region. Single long runs may, however, often be adequate 
for relatively straightforward problems, or as a preliminary to obtain inputs to multiple 
chains. 

A run with multiple chains requires overdispersed starting values, and these might be 
obtained from a preliminary single chain run; for example, one might take the 1st and 
99th percentiles of parameters from a trial run as initial values in a two chain run (Bray, 
2002), or the posterior means from a trial run combined with null starting values. 
Another option might combine parameters obtained as a random draw" from a trial 
run with null parameters. Null starting values might be zeroes for regression param- 
eters, one for precisions, and identity matrices for precision matrices. Note that not all 
parameters need necessarily be initialised, and parameters may instead be initialised by 
generating’ from their priors. 

A technique often useful to aid convergence, is the over-relaxation method of Neal 
(1998). This involves generates multiple samples of each parameter at the next iteration 
and then choosing the one that is least correlated with the current value, so potentially 
reducing the tendency for sampling to become trapped in a highly correlated random 
walk. 


1.4.1 Convergence diagnostics 


Convergence for multiple chains may be assessed using the Gelman-Rubin scale reduc- 
tion factors, which are included in WINBUGS, whereas single chain diagnostics require 
use of the CODA or BOA packages’® in Splus or R. The scale reduction factors 
compare variation in the sampled parameter values within and between chains. If 
parameter samples are taken from a complex or poorly identified model then a wide 
divergence in the sample paths between different chains will be apparent (e.g. Gelman, 
1996, Figure 8.1) and variability of sampled parameter values between chains will 
considerably exceed the variability within any one chain. Therefore, define 


p= sa" z B) AT —1) 


13 In WINBUGS this involves having separate inits files for each chain and changing the number of chains 
from the default value of 1 before compiling. 

'4 For example, by using the state space command in WINBUGS. 

15 This involves ‘gen ints’ in WINBUGS. 

16 Details of these options and relevant internet sites are available on the main BUGS site. 
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as the variability of the samples oP within the jth chain (j = 1, ... J). This is assessed 
over T iterations after a burn in of s iterations. An overall estimate of variability within 
chains is the average Vw of the V;. Let the average of the chain means 8; be denoted 8.. 
Then the between chain variance is 


Ta. <0 
Vg=——) 0; — 0. 
B Fa ) 


The Scale Reduction Factor (SRF) compares a pooled estimator of var(8), given by 
Vp =Vg/T TVw/(T-1) 
with the within sample estimate Vw. Specifically, the SRF is 
(Ve/Vw)” 


and values of the SRF, or ‘Gelman-Rubin statistic’, under 1.2 indicate approximate 
convergence. 

The analysis of sampled values from a single MCMC chain or parallel chains may 
be seen as an application of time series methods (see Chapter 5) in regard to problems 
such as assessing stationarity in an autocorrelated sequence. Thus, the autocorrelation 
at lags 1, 2, and so on, may be assessed from the original series of sampled values 0, 
Bt D, oC d- or from more widely spaced sub-samples K steps apart 0, 0 9, 6€ 24), 
Geweke (1992) developed a t-test applicable to assessing convergence in runs of sampled 
parameter values, both in single and multiple chain situations. Let 6, be the posterior 
mean of scalar parameter 6 from the first n, iterations in a chain (after burn-in), and 6, 
be the mean from the last n, draws. If there is a substantial run of intervening iterations, 
then the two samples should be independent. Let V, and V, be the variances of these 
averages'’. Then the statistic 


Z = (a —95)/(Va Vn)?” 


should be approximately N(0, 1). This test may be obtained in CODA or the BOA 
package. 


1.4.2 Model identifiability 


Problems of convergence of MCMC sampling procedures may reflect problems in 
model identifiability due to over-fitting or redundant parameters. Use of diffuse priors 
increases the chances of a poorly identified model, especially in complex hierarchical 
models (Gelfand and Sahu, 1999), and elicitation of more informative priors may assist 
identification and convergence. Slow convergence will show in poor ‘mixing’ with high 
autocorrelation in the successive sampled values of parameters, apparent graphically in 
trace plots that wander rather than rapidly fluctuating around a stable mean. 


"7 T¢ by chance the successive samples 0, t=1,..ng and oV, r = 1,..n, were independent, then V, and V, 
would be obtained as the population variance of the 0, namely V (0), divided by na and ny. In practice, 
dependence in the sampled values is likely, and V, and V, must be estimated by allowing for the autocorrela- 
tion. Thus 


Va = (1/na) ho $ Yy (| 


= 


where y; is the autocovariance at lag j. In practice, only a few lags may be needed. 
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Conversely, running multiple chains often assists in diagnosing poor identifiability of 
models. Examples might include random effects in nested models, for instance 


yy=e Ne wy = 1,..G7=1,..m (1.7) 


where n; ~ N(0, o2), uj ~ N(0, o7). Poor mixing may occur because the mean of the n; 
and the global mean p are confounded: a constant may be added to the n; and 
subtracted from p without altering the likelihood (Gilks and Roberts, 1996). Vines, 
Gilks and Wild (1996) suggest the transformation (or reparameterisation) in Equation 
(1.7), 


ven ia = 
leading to the model 
Yij =v Qi Uij 


ay ~ NỌ, (m — 1)o} /m) 
i m-j 4 
aj ~ n(- 2 Qk, hei) 
m—l1 
Am == Xoo 
k=1 


More complex examples occur in a spatial disease model with unstructured and spatially 
structured errors (Gelfand et al., 1998), sometimes known as a spatial convolution 
model and considered in Example 1.3 below, and in a particular kind of multiple 
random effects model, the age-period-cohort model (Knorr-Held and Rainer, 2001). 
Identifiability issues also occur in discrete mixture regressions (Chapter 3) and struc- 
tural equation models (Chapter 8) due to label switching during the MCMC sampling. 
Such instances of non-identifiability will show as essentially nonconvergent parameter 
series between chains, whereas simple constraints on parameters will typically achieve 
identifiability. For example, if a structural equation model involved a latent construct 
such as alienation and loadings on this construct were not suitably constrained, then 
one chain might fluctuate around a loading of —0.8 on social integration (the obverse of 
alienation) and another chain fluctuate around a loading of 0.8 on alienation. 

Correlation between parameters within the parameter set 8 = (0), 02, ...8,), such as 
between 8; and 62, also tends to delay convergence and to increase the dependence 
between successive iterations. Re-parameterisation to reduce correlation — such as 
centring predictor variables in regression — may improve convergence (Gelfand et al., 
1995; Zuur et al., 2002). In nonlinear regressions, a log transform of a parameter may be 
better identified than its original form (see Chapter 10 for examples in dose-response 
modelling). 


1.5 MODEL ASSESSMENT AND SENSITIVITY 


Having achieved convergence with a suitably identified model a number of processes 
may be required to firmly establish the models credibility. These include model choice 
(or possibly model averaging), model checks (e.g. with regard to possible outliers) and, 
in a Bayesian analysis, an assessment of the relation of posterior inferences to prior 
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assumptions. For example, with small samples of data or with models where the random 
effects are to some extent identified by the prior on them, there is likely to be sensitivity 
in posterior estimates and inferences to the prior assumed for parameters. There may 
also be sensitivity if an informative prior based on accumulated knowledge is adopted. 


1.5.1 Sensitivity on priors 


One strategy is to consider a limited range of alternative priors and assess changes in 
inferences; this is known as ‘informal’ sensitivity analysis (Gustafson, 1996). One might 
also consider more formal approaches to robustness based perhaps on non-parametric 
priors (such as the Dirichlet process prior) or on mixture (‘contamination’) priors. For 
instance, one might assume a two group mixture with larger probability 1 — p on the 
‘main’ prior 7;(8), and a smaller probability such as p = 0.2 on a contaminating density 
m2(0), which may be any density (Gustafson, 1996; Berger, 1990). One might consider 
the contaminating prior to be a flat reference prior, or one allowing for shifts in the 
main prior’s assumed parameter values (Berger, 1990). For instance, if 7,(8) is N(0, 1), 
one might take 7(8) ~ N(m, v2), where higher stage priors set ma ~ U( — 0.5, 0.5) and 
v2 ~ U(0.7, 1.3). 

In large datasets, regression parameters may be robust to changes in prior unless 
priors are heavily informative. However, robustness may depend on the type of param- 
eter and variance parameters in random effects models may be more problematic, 
especially in hierarchical models, where different types of random effect coexist in a 
model (Daniels, 1999; Gelfand et al., 1998). While a strategy of adopting just proper 
priors on variances (or precisions) is often advocated in terms of letting the data speak 
for themselves (e.g. gamma(a, a) priors on precisions with a = 0.001 or a = 0.0001), this 
may cause slow convergence and relatively weak identifiability, and there may be 
sensitivity in inferences between analyses using different supposedly vague priors (Kel- 
sall and Wakefield, 1999). One might introduce stronger priors favouring particular 
values more than others (e.g. a gamma(5, 1) prior on a precision), or even data based 
priors loosely based on the observed variability. Molli¢ (1996) suggests such a strategy 
for the spatial convolution model. Alternatively the model might specify that random 
effects and/or their variances interact with each other; this is a form of extra infor- 
mation. 


1.5.2 Model choice and model checks 


Additional forms of model assessment common to both classical and Bayesian methods 
involve measuring the overall fit of the model to the dataset as a basis for model choice, 
and assessing the impact of particular observations on model estimates and/or fit 
measures. Model choice is considered in Chapter 2 and certain further aspects which 
are particularly relevant in regression modelling are discussed in Chapter 3. While 
marginal likelihood, and the Bayes factor based on comparing such likelihoods, defines 
the canonical model choice, in practice (e.g. for complex random effects models or 
models with diffuse priors) this method may be relatively difficult to implement. 
Relatively tractable approaches based on the marginal likelihood principle include 
those of Newton and Raftery (1994) based on the harmonic average of likelihoods, 
the importance sampling method of Gelfand and Dey (1994), as exemplified by Lenk 
and Desarbo (2000), and the method of Chib (1995) based on the marginal likelihood 
identity (Equation (2.4) in Chapter 2). 
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Methods such as cross-validation by single case omission lead to a form of pseudo 
Bayes factor based on multiplying the CPO for model 1 over all cases and comparing 
the result with the same quantity under model 2 (Gelfand, 1996, p. 150). This approach 
when based on actual omission of each case in turn may (with current computing 
technology) be only practical with relatively small samples. Other sorts of partitioning 
of the data into training samples and hold-out (or validation) samples may be applied, 
and are less computationally intensive. 

In subsequent chapters, the main methods of model choice are (a) those based on 
predictive criteria, comparing model predictions z with actual observations'®, as advo- 
cated by Gelfand and Ghosh (1998) and others, and (b) modifications of classical 
deviance tests to reflect the effective model dimension, as in the DIC criterion discussed 
in Chapter 2 (Spiegelhalter et al., 2002). These are admittedly not formal Bayesian 
choice criteria, but are relatively easy to apply over a wide range of models including 
non-conjugate and heavily parameterised models. 

The marginal likelihood approach leads to posterior probabilities or weights on 
different models, which in turn are the basis for parameter estimates derived by model 
averaging (Wasserman, 2000). Model averaging has particular relevance for regression 
models, especially for smaller datasets where competing specifications provide closely 
comparable explanations for the data, and so there is a basis for weighted averages of 
parameters over different models; in larger datasets by contrast, most model choice 
diagnostics tend to overwhelmingly support one model. A form of model averaging also 
occurs under predictor selection methods, such as those of George and McCulloch 
(1993) and Kuo and Mallick (1998), as discussed in Chapter 3. 


1.5.3 Outlier and influence checks 


Outlier and influence analysis in Bayesian modelling may draw in a straightforward 
fashion from classical methods. Thus in a linear regression model with Normal errors 


Vi=Br Bor -Bp 1Xpi i 


the posterior mean of ê; = y; — Bx; compared to its posterior standard deviation pro- 
vides an indication of outlier status (Pettitt and Smith, 1985; Chaloner, 1998) — see 
Example 3.12. In frequentist applications of this regression model, the influence of a 
particular case is apparent in the ratio of Var(é;) = o7(1 — v;) to the overall residual 
variance o°, where v; = x/[X'X | 'x;, with X the nx (p 1) covariate matrix for all 
cases; a similar procedure may be used in Bayesian analysis. 

Alternatively, the CPO predictive quantity f(yi|y;-,) may be used as an outlier 
diagnostic and as the basis for influence measures. Weiss and Cho (1998) consider 
possible divergence criteria in terms of the ratios a; = [CPO;/f(,;|8)], such as the 
Lı norm, with the influence of case i on the totality of model parameters then repre- 


'8 A simple approach to predictive fit generalises the method of Laud and Ibrahim (1995) — see Example 3.2 — 
and is mentioned by Gelfand and Ghosh (1998), Sahu et al. (1997) and Ibrahim et al. (2001). Let y; be the 
observed data, be the parameters, and z; be ‘new’ data sampled from f(z|b). Suppose v; and ç; are the 
posterior mean and variance of z;, then one possible criterion for any w > 0 is 


C= dog ww DID OW ye 
i=1 i=l 


Typical values of w at which to compare models might be w = 1, w = 10 and w = 100, 000. Larger values of w 
put more stress on the match between v; and y; and so downweight precision of predictions. Gelfand and 
Ghosh (1998) develop deviance-based criteria specific for non-Normal outcomes (see Chapter 3), though these 
assume no missingness on the response. 
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sented by d(a;) = 0.5|a; — 1| — see Example 1.4. Specific models, such as those introdu- 
cing latent data, lead to particular types of Bayesian residual (Jackman, 2000). Thus, in 
a binary probit or logit model, underlying the observed binary y are latent continuous 
variables z, confined to negative or positive values according as y is 0 or 1. The 
estimated residual is then z — Êx; analogously to a Normal errors model. 


Example 1.3 Lung cancer in London small areas As an example of the possible 
influence of prior specification on regression coefficients and random effects, consider 
a small area health outcome: female lung cancer deaths y; in the three year period 
1990-92 in 758 London small areas!’ (electoral wards). If we focus first on regression 
effects, there is overwhelming accumulated evidence that ill health and mortality (espe- 
cially lung cancer deaths) are higher in more deprived, lower income areas. Having 
allowed for the impact of age differences via indirect standardisation (to provide 
expected deaths £;) variations in this type of mortality are expected to be positively 
related to a deprivation score x;, which is in standard form (zero mean, variance 1). The 
following model is assumed 


yi ~ Poi(y,) 
W; = Eip; 
log(p) = Bı Box; 


The only parameters, 8, and 85, are assigned diffuse but proper N(0,1000) priors. Since 
the sum of observed and expected deaths is the same and x is standardised, one might 
expect Bı to be near zero. Two sets initial values of adopted B = (0, 0) and B = (0, 0.2) 
with the latter the mean of a trial (single chain) run. A two chain run then shows early 
convergence via Gelman-Rubin criteria (at under 250 iterations) and from iterations 
250-2500 pooled over the chains a 95% credible interval for B, of (0.18,0.24) is 
obtained. 

However, there may well be information which would provide more informative 
priors. Relative risks p; between areas for major causes of death (from chronic disease) 
reflect, albeit imperfectly, gradients in risk for individuals over attributes such as 
income, occupation, health behaviours, household tenure, ethnicity, etc. These gradi- 
ents typically show at most five fold variation between social categories except perhaps 
for risk behaviours directly implicated in causing a disease. Though area contrasts may 
also be related to environmental influences (usually less strongly) accumulated evidence, 
including evidence for London wards, suggests that extreme relative contrasts in stand- 
ard mortality ratios (100 x p,) between areas are unlikely to exceed 10 or 20 (i.e. SMRs 
ranging from 30 to 300, or 20 to 400 at the outside). Simulating with the known 
covariate x; and expectancies Æ; it is possible to obtain or ‘elicit’ priors consistent 
with these prior beliefs. For instance one might consider taking a N(0,1) prior on Bı 
and a N(0.5,1) prior on B». The latter favours positive values, but still has a large part of 
its density over negative values. 

Values of y; are simulated (see Model 2 in Program 1.3) with these priors; note that 
initial values are by definition generated from the priors, and since this is pure simula- 
tion there is no notion of convergence. Because relative risks tend to be skewed, the 
median relative risks (i.e. y;/E;) from a run of 1000 iterations are considered as 


1° The first is the City of London (1 ward), then wards are alphabetic within boroughs arranged alphabetically 
(Barking, Barnet, ....,Westminster). All wards have five near neighbours as defined by the nearest wards in 
terms of crow-fly distance. 
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summaries of contrasts between areas under the above priors. The extreme relative risks 
are found to be 0 and 6 (SMRs of 0 and 600) and the 2.5% and 97.5% percentiles of 
relative risk are 0.37 and 2.99. So this informative prior specification appears broadly in 
line with accumulated evidence. 

One might then see how far inference about 8, is affected by adopting the N(0.5, 1) 
prior instead of the N(0, 1000) diffuse prior? when the observations are restored. In 
fact, the 95% credible interval from a two chain run (with initial values as before and run 
length of 2500 iterations) is found to be the same as under the diffuse prior. 

A different example of sensitivity analysis involves using a contamination prior on B>. 
Thus, suppose 7(B,) is N(0.5, 1) as above, but that for 72(B,) a Student ¢ with 2 degrees 
of freedom but same mean zero and variance is adopted, and p = 0.1. Again, the same 
credible interval for B, is obtained as before (Model 3 in Program 1.3). One might take 
the contaminating prior to be completely flat (dflat( ) in BUGS), and this is suggested as 
an exercise. In the current example, inferences on B, appear robust here to alternative 
priors, and this is frequently the case with regression parameters in large samples — 
though with small datasets there may well be sensitivity. 

An example where sensitivity in inferences concerning random effects may occur is 
when the goal in a small area mortality analysis is not the analysis of regressor effects 
but the smoothing of unreliable rates based on small event counts or populations at risk 
(Manton et al., 1987). Such smoothing or ‘pooling strength’ uses random effects over a 
set of areas to smooth the rate for any one area towards the average implied under the 
density of the effects. Two types of random effect have been suggested, one known as 
unstructured or ‘white noise’ variation, whereby smoothing is towards a global average, 
and spatially structured variation whereby smoothing is towards the average in the 
‘neighbourhood’ of adjacent wards. Then the total area effect a; consists of an unstruc- 
tured or ‘pure heterogeneity’ effect v; and a spatial effect ;. While the data holds 
information about which type of effect is more predominant, the prior on the variances 
o? and o$ may also be important in identifying the relative roles of the two error 
components. 

A popular prior used for specifying spatial effects, the CAR(1) prior of Besag et al. 
(1991), introduces an extra identifiability issue in that specifies differences in risk 
between areas i and j, 6; — ;, but not the average level (i.e. the location) of the spatial 
risk (see Chapter 7). This prior can be specified in a conditional form, in which 


di~ NÒ ej, 04/Mi) 
JEA: 
where M; is the number of areas adjacent to area i, and j € A; denotes that set of areas. 
To resolve the identifiability problem one may centre the sampled b; at each MCMC 
iteration and so provide a location, i.e. actually use in the model to predict log (p;) the 


shifted effects $; = p; — ġ. In fact, following Sun et al. (1999), identifiability can also be 
gained by introducing a correlation parameter y 


di~ NOJ oj, 04 /Mi) (1.8) 


jedi 


which is here taken to have prior y ~ U(0, 1). Issues still remain in specifying priors on 
o? and o% (or their inverses) and in identifying both these variances and the separate 
risks v; and @; in each area in the model 


20 The prior on the intercept is changed to N(0, 1) also. 
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log(p) =B v >; 


where the prior for Q; is taken to be as in Equation (1.8) and where v; ~ N(0, o2). A 
‘diffuse prior’ strategy might be to adopt gamma priors G(a1, a2) on the precisions 1/02 
and 1/ ore where a; = a = a and a is a small constant such as a = 0.001, but possible 
problems in doing this are noted above. One might, however, set priors on a, and a 
themselves rather than presetting them (Daniels and Kass, 1999), somewhat analogous 
to contamination priors in allowing for higher level uncertainty. 

Identifiability might also be improved by instead linking the specification of v; and 6; 
in some way (see Model 4 in Program 1.3). For example, one might adopt a bivariate 
prior on these random effects as in Langford et al. (1998) and discussed in Chapter 7. Or 
one might still keep v; and ; as univariate errors, but recognise that the variances are 
interdependent, for instance taking o? = cog so that one variance is conditional on the 
other and a pre-selected value of c. Bernardinelli et al. (1995) recommend c = 0.7. A 
prior on c might also be used, e.g. a gamma prior with mean 0.7. One might alternatively 
take a bivariate prior (e.g. bivariate Normal) on log (02) and log (04). Daniels (1999) 
suggests uniform priors of the ratio of one variance to the sum of the variances, for 
instance a U(0, 1) prior on o2/[o} 0%], though the usual application of this approach 
is to other forms of hierarchical model. 

Here we first consider independent G(0.5, 0.0005) priors on 1/0? and 1/ oS in a two 
chain run. One set of initial values is provided by ‘default’ values, and the other by 
setting the model’s central parameters to their mean values under an initial single chain 
run. The problems possible with independent diffuse priors show in the relatively slow 
convergence of og; not until 4500 iterations does the Gelman-Rubin statistic fall below 
1.1. As an example of inferences on relative mortality risks, the posterior mean for the 
first area, where there are three deaths and 2.7 expected (a crude relative risk of 1.11), is 
1.28, with 95% interval from 0.96 to 1.71. The risk for this area is smoothed upwards to 
the average of its five neighbours, all of which have relatively high mortality. This 
estimate is obtained from iterations 4500-9000 of the two chain run. The standard 
deviations o, and o4 of the random effects have posterior medians 0.041 and 0.24. 

In a second analysis the variances”! are interrelated with o2 = co}, and c taken as 
G(0.7, 1). This is relatively informative prior structure, and reflects the expectation that 
any small area health outcome will probably show both types of variability. Further, the 
prior on 1 /0% allows for uncertainty in the parameters, i.e. instead of a default prior 
such as 1/04} ~G(1, 0.001), it is assumed that 


1/0}, ~ Gla, a) 
with 
a, ~ Exp(1) 
ay ~ G(1, 0.001) 


The priors for a; and a reflect the option sometimes used for a diffuse prior on 
precisions such as 1 [53> namely a G(1, v) prior on precisions (with v preset at a small 
constant, such as v = 0.001). 

This model achieves convergence in a two chain run of 10000 iterations at around 
3000 iterations, and yields a median for c of 0.12, and for o, and ow of 0.084 and 0.24. 
The posterior medians of a, and a, are 1.7 and 0.14. Despite the greater element of pure 


?! In BUGS this inter-relationship involves precisions. 
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heterogeneity the inference on the first relative risk is little affected, with mean 1.27 and 
95% credible interval (0.91, 1.72). 

So some sensitivity is apparent regarding variances of random effects in this example 
despite the relatively large sample, though substantive inferences may be more robust. A 
suggested exercise is to experiment with other priors allowing interdependent variances 
or errors, e.g. a U(0, 1) prior on of/[o} 0%]. A further exercise might involve 
summarising sensitivity on the inferences about relative risk, e.g. how many of the 758 
mean relative risks shift upward or downward by more than 2.5%, and how many by 
more than 5%, in moving from one random effects prior to another. 


Example 1.4 Gessel score To illustrate possible outlier analysis, we follow Pettitt and 
Smith (1985) and Weiss and Cho (1998), and consider data for n = 21 children on 
Gessel adaptive score (y) in relation to age at first word (x in months). Adopting a 
Normal errors model with parameters B = (B1, B2), estimates of the CPO; may be 
obtained by single case omission, but an approximation based on a single posterior 
sample avoids this. Thus for T samples (Weiss, 1994), 


CPO AT TS 1FO18% SP’ 


t=1 


or the harmonic mean of the likelihoods of case i. Here an initial run is used to estimate 
the CPOs in this way, and a subsequent run produces influence diagnostics, as in Weiss 
and Cho (1998). It is apparent (Table 1.1) that child 19 is both a possible outlier and 
influential on the model parameters, but child 18 is influential without being an outlier. 


Table 1.1 Diagnostics for Gessel score 


Child CPO Influence (Kullback K1) Influence (L1 norm) Influence (chi square) 


1 0.035 0.014 0.066 0.029 
2 0.021 0.093 0.161 0.249 
3 0.011 0.119 0.182 0.341 
4 0.025 0.032 0.096 0.071 
5 0.024 0.022 0.081 0.047 
6 0.035 0.015 0.068 0.031 
7 0.034 0.015 0.068 0.031 
8 0.035 0.014 0.067 0.029 
9 0.034 0.016 0.071 0.035 
10 0.029 0.023 0.083 0.051 
11 0.019 0.067 0.137 0.166 
12 0.033 0.017 0.072 0.035 
13 0.011 0.119 0.182 0.341 
14 0.015 0.066 0.137 0.163 
15 0.032 0.015 0.068 0.032 
16 0.035 0.014 0.067 0.030 
17 0.025 0.022 0.081 0.047 
18 0.015 1.052 0.387 75.0 
19 0.000138 2.025 0.641 56.2 
20 0.019 0.042 0.111 0.098 
21 0.035 0.014 0.067 0.030 
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As Pettitt and Smith (1985) note, this is because child 18 is outlying in the covariate 
space, with age at first word (x) much later than other children, whereas child 19 is 
outlying in the response (y) space. 


1.6 REVIEW 


The above worked examples are inevitably selective, but start to illustrate some of 
the potentials of Bayesian methods but also some of the pitfalls in terms of the 
need for ‘cautious inference’. The following chapters consider similar modelling 
questions to those introduced here, and include a range of worked examples. The 
extent of possible model checking in these examples is effectively unlimited, and a 
Bayesian approach raises additional questions such as sensitivity of inferences to 
assumed priors. 

The development in each chapter draws on contemporary discussion in the statistical 
literature, and is not confined to reviewing Bayesian work. However, the worked 
examples seek to illustrate Bayesian modelling procedures, and to avoid unduly lengthy 
discussion of each, the treatments will leave scope for further analysis by the reader 
employing different likelihoods, prior assumptions, initial values, etc. 

Chapter 2 considers the potential for pooling information across similar units 
(hospitals, geographic areas, etc.) to make more precise statements about parameters 
in each unit. This is sometimes known as ‘hierarchical modelling’, because higher 
level priors are specified on the parameters of the population of units. Chapter 3 
considers model choice and checking in linear and general linear regressions. Chapter 
4 extends regression to clustered data, where regression parameters may vary randomly 
over the classifiers (e.g. schools) by which the lowest observation level (e.g. pupils) are 
classified. 

Chapters 5 and 6 consider time series and panel models, respectively. Bayesian 
specifications may be relevant to assessing some of the standard assumptions 
of time series models (e.g. stationarity in ARIMA models), give a Bayesian interpret- 
ation to models commonly fitted by maximum likelihood such as the basic structural 
model of Harvey (1989), and facilitate analysis in more complex problems, for example, 
shifts in means and/or variances of series. Chapter 6 considers Bayesian treatments of 
the growth curve model for continuous outcomes, as well as models for longitudinal 
discrete outcomes, and panel data subject to attrition. Chapter 7 considers observations 
correlated over space rather than through time, and models for discrete 
and continuous outcomes, including instances where regression effects may 
vary through space, and where spatially correlated outcomes are considered through 
time. 

An alternative to expressing correlation through multivariate models is to introduce 
latent traits or classes to model the interdependence. Chapter 8 considers a variety of 
what may be termed structural equation models, the unity of which with the main body 
of statistical models is now being recognised (Bollen, 2001). 

The final two chapters consider techniques frequently applied in biostatistics and 
epidemiology, but certainly not limited to those application areas. Chapter 9 considers 
Bayesian perspectives on survival analysis and chapter 10 considers ways of using data 
to develop support for causal mechanisms, as in meta-analysis and dose-response 
modelling. 
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CHAPTER 2 


Hierarchical Mixture Models 


2.1 INTRODUCTION: SMOOTHING TO THE POPULATION 


A relatively simple Bayesian problem, but one which has motivated much research, is 
that of ensemble estimation, namely estimating the parameters of a common distribu- 
tion thought to underlay a collection of outcomes for similar types of units. Among 
possible examples are medical, sports, or educational: death rates for geographical 
areas, batting averages for baseball players, Caesarian rates in maternity units, and 
exam success rates for schools. Given the parameters of the common density, one seeks 
to make conditional estimates of the true outcome rate in each unit of observation. 
Because of this conditioning on the higher stage densities, such estimation for sets of 
similar units is also known as ‘hierarchical modelling’ (Kass and Steffey, 1989; Lee, 
1997, Chapter 8). For instance, in the first stage of the Poisson-gamma model con- 
sidered below, the observed counts are conditionally independent given the unknown 
means that are taken to have generated them. At the second stage, these means are 
themselves determined by the gamma density parameters, while the density for the 
gamma parameters forms the third stage. 

These procedures, whether from a full or empirical Bayes perspective, usually result in 
a smoothing of estimates for each unit towards the average outcome rate, and have 
generally been shown to have greater precision and better out of sample predictive 
performance. Specifically, Rao (1975) shows that with respect to a quadratic loss 
function, empirical Bayes estimators outperform classical estimators in problems of 
simultaneous inference regarding a set of related parameters. These procedures may, 
however, imply a risk of bias as against unadjusted maximum likelihood estimates — this 
dilemma is known as the bias-variance trade-off. 

Such procedures for ‘pooling strength’ rest on implicit assumptions: that the units are 
exchangeable (similar enough to justify an assumption of a common density), and that 
the smoothing model chosen is an appropriate one. It may be that units are better 
considered exchangeable within sub-groups of the data (e.g. outcomes for randomised 
trials in one sub-group vs. outcomes for observational studies in another). Model choice 
is an additional uncertainty (e.g. does one take parametric or non-parametric approach 
to smoothing, and if a non-parametric discrete mixture, how many components?). 

Therefore, this chapter includes some guidelines as to model comparison and choice, 
which will be applicable to this and later chapters. There are no set “gold standard’ 
model choice criteria, though some arguably come closer to embodying true Bayesian 
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principles than others. Often one may compare ‘classical’ fit measures such as deviance 
or the Akaike Information Criterion (Bozdogan, 2000), either averages over an MCMC 
chain (e.g. averages of deviances D attaching to parameters 6 at each iteration), or in 
terms of the deviance at the posterior mean. These lead to a preliminary sifting of 
models and more comprehensive model assessment, and selection is reserved to a final 
stage of the analysis involving a few closely competing models. 


2.2 GENERAL ISSUES OF MODEL ASSESSMENT: MARGINAL 
LIKELIHOOD AND OTHER APPROACHES 


There is usually uncertainty about appropriate error structures and predictor variables 
to include in models. Adding more parameters may improve fit, but maybe at the 
expense of identifiability and generalisability. Model selection criteria assess whether 
improvements in fit measures such as likelihoods, deviances or error sum of squares 
justify the inclusion of extra parameters in a model. Classical and Bayesian model 
choice methods may both involve comparison either of measures of fit to the current 
data or cross validatory fit to out of sample data. For example, the deviance statistics of 
general linear models (with Poisson, normal, binomial or other exponential family 
outcomes) follow standard densities for comparisons of models nested within one 
another, at least approximately in large samples (McCullagh and Nelder, 1989). Penal- 
ised measures of fit (Bozdogan, 2000; Aikake, 1973) may be used, involving an adjust- 
ment to the model log-likelihood or deviance to reflect the number of parameters in the 
model. 

Thus, suppose L denotes the likelihood and D the deviance of a model involving p 
parameters. The deviance may be simply defined as minus twice the log likelihood, 
D = —2 log L, or as a scaled deviance: 


D' = —2 log (L/ Ls), 


where L, is the saturated likelihood obtained by an exact fit of predicted to observed 
data. Then to allow for the number of parameters (or ‘dimension’ of the model), one 
may use criteria such as the Akaike Information Criterion (or AIC), expressed either as! 


D 2p 
or 
D 2p 
So when the AIC is used to compare models, an increase in likelihood and reduction 
in deviance is offset by a greater penalty for more complex models. 
Another criterion used generally as a penalised fit measure, though also justified as an 
asymptotic approximation to the Bayesian posterior probability of a model, is the 
Schwarz Information Criterion (Schwarz, 1978). This is also often called the Bayes 


Information Criterion. Depending on the simplifying assumptions made, it may take 
different forms, but the most common version is, for sample of size n, 


BIC=D plog, (n) 


' So a model is selected if it has lowest AIC. Sometimes the AIC is obtained as L — p with model selection 
based on maximising the AIC. 
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Under this criterion models with lower BIC are chosen, and larger models (with more 
parameters) are more heavily penalised than under the AIC. The BIC approximation 
for model j is derived by considering the posterior probability for the model M; as in 
Equation (2.1) below, and by expanding minus twice the log of that quantity around the 
maximum likelihood estimate (or maybe some other central estimate). 

In Bayesian modelling, prior information is introduced on the parameters, and the fit 
of the model to the data at hand and the resulting posterior parameter estimates are 
constrained to some degree by adherence also to this prior ‘data’. One option is to 
simply compare averages of standard fit measures such as the deviance or BIC over an 
MCMC run, e.g. consider model choice in terms of a model which has minimum 
average AIC or BIC. Approaches similar in some ways to classical model validation 
procedures are often required because the canonical Bayesian model choice methods 
(via Bayes factors) are infeasible or difficult to apply in complex models or large samples 
(Gelfand and Ghosh, 1998; Carlin and Louis, 2000, p. 220). The Bayes factor may be 
sensitive to the information contained in diffuse priors, and is not defined for improper 
priors. 

Monitoring fit measures such as the deviance over an MCMC run has utility if 
one seeks penalised fit measures taking account of model dimension. A complication 
is that the number of parameters in complex random effects models is not actually 
defined. Here work by Spiegelhalter et al. (2002) may be used to estimate the effective 
number of parameters, denoted pe. Specifically, for data y and parameters 0, pe is 
approximated by the difference between the expected deviance E(D|y, 0), as measured 
by the posterior mean of sampled deviances D® = D(@) at iterations t = 1, .., T ina 
long MCMC run, and the deviance D(6|y), evaluated at the posterior mean 6 of the 
parameters. Then one may define a penalised fit measure analogous to the Akaike 
information criterion as 


D(@ly) 2Pe 


and this has been termed the Deviance Information Criterion. Alternatively a modified 
Bayesian Information Criterion BIC = D(@| y) pe log (n) may be used, as this takes 
account of both sample size and complexity (Upton, 1991; Raftery, 1995). Note that pe 
might also be obtained by comparing an average likelihood with the likelihood at the 
posterior mean and then multiplying by 2. Related work on effective parameters when 
the average likelihoods of two models are compared appears in Aitkin (1991). 

The Bayesian approach to model choice and its implementation via MCMC sampling 
methods has benefits in comparisons of non-nested models — for instance, in comparing 
two nonlinear regressions or comparing a beta-binomial model as against a discrete 
mixture of binomials (Morgan, 2000). A well known problem in classical statistics is 
in likelihood comparisons of discrete mixture models involving different numbers of 
components, and here the process involved in Bayesian model choice is simpler. 


2.2.1 Bayes model selection using marginal likelihoods 


The formal Bayesian model assessment scheme involves marginal likelihoods, and while 
it follows a theoretically clear procedure may in practice be difficult to implement. 
Suppose K models, denoted Mp, k =1,..K, have prior probabilities b, = P(M;,) 
assigned to them of being true, with X-1, xb, = 1. Let 0, be the parameter set in 
model k, with prior 7(6;). Then the posterior probabilities attaching to each model after 
observing data y are 
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K 
PM») = POMY [FOT ODdo/ $ P) [omod eD 
j=l 
where f(y|6;) = L(8,|y) is the likelihood of the data under model k. The integrals in 
both the denominator and numerator of Equation (2.1) are known as prior predictive 
densities or marginal likelihoods (Gelfand and Dey, 1994). They give the probability of 
the data conditional on a model as 


PO == [oora (2.2) 


The marginal density also occurs in Bayes Formula for updating the parameters 0; of 
model k, namely 


TOY) =S OlT) my) (2.3) 


where 77(8;|y) denotes the posterior density of the parameters. This is also expressible as 
the ‘marginal likelihood identity’ (Chib, 1995; Besag, 1989): 


mY) =f (y|9x)7(8;)/7(Gx|y) (2.4) 


Model assessment can often be reduced to a sequential set of choices between two 
competing models — though an increased emphasis is now being placed on averaging 
inferences over models. It is in such comparisons that marginal likelihoods play a role. 
The formal method for comparing two competing models in a Bayesian framework 
involves deriving posterior odds after estimating the models separately. For equal prior 
odds on two models M, and M3, with parameters 0; and 02 of dimension pı and pp, this 
is equivalent to examining the Bayes factor on model 2 versus model 1. The Bayes factor 
is obtained as the ratio of marginal likelihoods mı( y) and m( y), such that 


PMily) _ PQ|M:) PON) 
P(M2|y)  PO|M) PM) 
PosteriorOdds Bayesfactor PriorOdds 
(= [m (y)/m) [1 /62]) 


The integral in Equation (2.2) can in principle be evaluated by sampling from the prior 
and calculating the resulting likelihood, and is sometimes available analytically. How- 
ever, more complex methods are usually needed, and in highly parameterised or non- 
conjugate models a fully satisfactory procedure has yet to be developed. Several ap- 
proximations have been suggested, some of which are described below. Another issue 
concerns Bayes factor stability when flat or just proper non-informative priors are used 
on parameters. It can be demonstrated that such priors lead (when models are nested 
within each other) to simple models being preferred over more complex models — this is 
Lindley’s paradox (Lindley, 1957), with more recent discussions in Gelfand and Dey 
(1994) and DeSantis and Spezzaferri (1997). By contrast, likelihood ratios used in 
classical testing tend to favour more complex models by default (Gelfand and Dey, 
1994). Even under proper priors, with sufficiently large sample sizes the Bayes factor 
tends to attach too little weight to the correct model and too much to a less complex or 
null model. Hence, some advocate a less formal view to Bayesian model selection based 
on predictive criteria other than the Bayes factor (see Section 2.2.4). These may lead to 
model checks analogous to classical p tests or to pseudo-Bayes factors of various kinds. 


(2.5) 
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2.2.2 Obtaining marginal likelihoods in practice 


MCMC simulation methods are typically applied to deriving posterior densities f(6|y) 
or sampling predictions new in models considered singly. However, they have extended 
to include parameter estimation and model choice in the joint parameter and model 
space {0,, Mx} fork = 1,..,K (Carlin and Chib, 1995). Thus, at iteration ¢ there might 
be a switch between models (e.g. from M; to Mp) and updating only on the parameters 
in model k. For equal prior model probabilities, the best model is the one chosen most 
frequently, and the posterior odds follow from Equation (2.5). The reversible jump 
algorithm of Green (1995) also provides a joint space estimation method. 

However, following a number of studies such as Chib (1995), Lenk and Desarbo 
(2000) and Gelfand and Dey (1994), the marginal likelihood of a single model may be 
approximated from the output of MCMC chains. The most simple apparent estimator 
of the marginal likelihood would apply the usual Monte Carlo methods for estimating 
integrals in Equation (2.2). Thus for each of a large number of draws, t = 1,.., T from 
the prior density of @, one may evaluate the likelihood L© = L(6|y) at each draw, and 
calculate the average. Subject to possible numerical problems, this may be feasible with 
a moderately informative prior, but would require a considerable number of draws (T 
perhaps in the millions). 

Since Equation (2.4) is true for any point, this suggests another estimator for m( y) 
based on an approximation for the posterior density 7(6|y), perhaps at a high density 
point such as the mean 6. So taking logs throughout, 


log (m(y)) = log(f(y|®) log m8) — log #(6|y) (2.6) 


Alternatively, following DiCiccio et al. (1997), Gelfand and Dey (1994, p. 511), and 
others, importance sampling may be used. In general, the integral of a function h(u) may 
be written as 


H = | nandu = | th /g0) gdu 
where g(u) is the importance function. Suppose u™, uw, ..,u“ are a series of draws 


from this function g which approximates A, whereas A itself which is difficult to sample 
from. An estimate of H is then 


T 
TYO aw)/gu) 
=l 
As a particular example, the marginal likelihood might be expressed as 


m(y) = [row x(0) do = [roo] g(0)d0 


where g is a normalised importance function for f ( y|6)7(6). The sampling estimate of is 
then 


T 
iy) =T X L0)n(0)/g(0) 
t=1 


where 0), 0®,.., 0 are draws from the importance function g. In practice, only an 
unnormalised density g* may be known, and the normalisation constant is estimated as 
TEL, 7(0)/g*(0), with corresponding sampling estimate 
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T T 
my) = X L0)w(0) Sw) (2.7) 
t=1 t=1 


where w(0) = 7(0)/g*(0). Following Geweke (1989), it is desirable that the tails 
of the importance function g decay slower than those of the posterior density that 
the importance function is approximating. So if the posterior density is multivariate 
Normal (for analytic reasons or by inspection of MCMC samples), then a multivari- 
ate Student ¢ with low degrees of freedom is most appropriate as an importance 
density. 

A special case occurs if g* = Lr, leading to cancellation in Equation (2.7) and to the 
harmonic mean of the likelihoods as an estimator for m( y), namely 


my) = T/TS > 1/23] (2.8) 


For small samples this estimator may, however, be subject to instability (Chib, 
1995). For an illustration of this criterion in disease mapping, see Hsiao et al. 
(2000). 

Another estimator for the marginal likelihood based on importance sampling ideas is 
obtainable from the relation” 


g(8) 


—> nð dð 
L@lyyn(o) EP 


pmo) = | 
so that 


n= | robo 
= 1/Elg(©)/{L(0|y) =(0)}] 


where the latter expectation is with respect to the posterior distribution of 8. 
The marginal likelihood may then be approximated by 


MOJ =A S 2° n® = T/L Se? ALO a} (2.9) 


Evidence on the best form of g() to use in Equation (2.9) is still under debate, but it is 
generally recommended to be a function (or product of separate functions) that ap- 
proximates 7(6|y). So in fact two phases of sampling are typically involved: an initial 
MCMC analysis to provide approximations g to f(6|y) or its components; and a second 
run recording g®, LO and wv” at iterations t=1,..,7, namely the values of the 
importance density, the likelihood and the prior as evaluated at the sampled values 
0, which are either from the posterior (after convergence), or from g itself. The 
importance density and prior value calculations, g and 7, may well involve a product 
over relevant components for individual parameters. 


? For a normalised density 1 = f g(6)d0 = f 2(0)[m(y)7(|y)/{L(6|y)7(0)}]d0, where the term enclosed in 
square brackets follows from Equation (2.3). 
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For numeric reasons (i.e. underflow of likelihoods L® in larger samples), it may be 
more feasible to obtain estimates of log [m(y)] in Equation (2.9), and then take expo- 
nentials to provide a Bayes factor. This involves monitoring 


8 = log[g?/{LOn] = log(g) — [log(L®)  log(n)] 
for T iterations. Then a spreadsheet? might be used to obtain 
AO = exp [8] (2.10) 


and then minus the log of the average of the A” calculated, so that 


log [i( y)] = — log (A) 


If exponentiation in Equation (2.10) leads to numeric overflow, a suitable constant 
(such as the average of the 8 can be subtracted from the 3” before they are exponen- 
tiated, and then also subtracted from — log (A). 


2.2.3 Approximating the posterior 


In Equations (2.6) and (2.9), an estimate of the marginal likelihood involves a function g 
that approximates the posterior 7(6|yv) using MCMC output. One possible approxima- 
tion entails taking moment estimates of the joint posterior density of all parameters, or a 
product of moment estimate approximations of posterior densities of individual param- 
eters or subsets of parameters. Suppose 0 is of dimension q and the sample size is n. 
Then, as Gelfand and Dey (1994) state, a possible choice for g to approximate the 
posterior would be a multivariate normal or Student ¢ with mean of length g and 
covariance matrices of dimension q xq that are computed from the sampled 
a, t=1,.., 7; 7=1,..q. The formal basis for this assumption of multivariate nor- 
mality of the posterior density, possibly after selective parameter transformation, rests 
with the Bayesian version of the central limit theorem (Kim and Ibrahim, 2000). 

In practice, for complex models with large numbers of parameters, one might split the 
parameters into sets (Lenk and Desarbo, 2000), such as regression parameters, vari- 
ances, dispersion matrices, mixture proportions, and so on. Suppose the first subset of 
parameters in a particular problem consists of regression parameters with sampled 
values BP ,t=1,..,7;j7 =1,..,q1. For these the posterior density might be approxi- 
mated by taking g(ß) to be multivariate normal or multivariate ¢, with the mean and 
dispersion matrices defined by the posterior means and the qı x qı dispersion matrix 
taken from a long MCMC run of T iterations on the qı parameters. Geweke (1989) 
considers more refined methods such as split Normal or ¢ densities for approximating 
skew posterior densities, as might occur in nonlinear regression. 

The next set, indexed j= qı 1,...,q2, might be the parameters of a precision 
matrix 


T= D 


for interdependent errors. For a precision matrix T of order r = q2 — qı, with Wishart 
prior W(Qo, ro), the importance density g(T) may be provided by a Wishart withn ro 
degrees of freedom and scale matrix Q = S(n ro), where S is the posterior mean of 
T~!. The set indexed byj = q2 1,..,q3 might be variance parameters Q; for independ- 


3 A spreadsheet is most suitable for very large or small numbers that often occur in this type of calculation. 
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ent errors. Since variances themselves are often skewed, the posterior of x; = log (¢;) 
may better approximate normality. The parameters indexed j = q3 1,..,q4 might be 
components = (1, W>,.., Wy) of a Dirichlet density* of dimension J = q4 — q3. Sup- 
pose J = 2, as in Example 2.2 below, then there is one free parameter to consider with 
prior beta density. If the posterior mean and variance of s from a long MCMC run are 
ky and V,, then these may be equated to the theoretical mean and variance, as in 
My = 4,/H and Vy = a,b,/H?[H 1], where H = (ap bp). Solving gives an approxi- 
mation to the posterior density of as a beta density with sample size 


H = [kyQ — ky) — Vyl/ Vy 


and success probability ky. 

So for the MCMC samples 0 = {8%, TO, x, w,..}, the values taken by the 
approximate posterior densities, namely g(8), g?(T), g?Qb) and g(x) and other 
stochastic quantities, are evaluated. Let the values taken by the product of these 
densities be denoted g. This provides the values of each parameter sample in the 
approximation to the posterior density 7(6|y) (Lenk and Desarbo, 2000, p. 117), and 
these are used to make the estimate m(y) in Equations (2.6) or (2.9). An example of how 
one might obtain the components of g using this approach, a beta-binomial mixture is 
considered in Example 2.2. 

Chib (1995) proposes a method for approximating the posterior in analyses when 
integrating constants of all full conditional densities are known as they are in standard 
conjugate models. Suppose the parameters fall into B blocks (eg. B=2 in 
linear univariate regression, with one block being regression parameters and the other 
being the variance). Consider the posterior density as a series of conditional densities, 
with 


a7(8|y) = T(ði ly) (83/01, y) 7(83|61, 02, y) MEET Tn(0g|ðg-1, 0p->, eee 01, y) 


In particular, 


m(6*|y) = 7(01*|y) 7(62*|01*, y) 7(83*|01*, 02%, y)...-. 


(2.11) 
7(Op*|Op-1*, Op-2*, ...01*, y) 


where 6* is a high density point, such as the posterior mean 6, where the posterior 
density in the marginal likelihood identity (2.4) may be estimated. 

Suppose a first run is used to provide 6*. Then the value of the first of these densities, 
namely 7(6;*|y) is analytically 


(01 *|y) = | 7(81*|y, 02, 03, ..-0p) a(82, 03, ia On|) d®, d03, .. d0p 


and may be estimated in a subsequent MCMC run with all parameters free. If this run is 
of length T, then the average of the full conditional density of 6; evaluated at the 
samples of the other parameters provides 


4 In a model involving a discrete mixture with J classes, define membership indicators G; falling into one of J 
possible categories, so that G; = j if individual subject i is assigned to class j. The assignment will be 
determined by a latent class probability vector b = (41, W2, . -ẹpy), usually taken to have a Dirichlet prior. 
The MCMC estimates E(;) and var(;) then provide moment estimates of the total sample size v in the 
posterior Dirichlet and the posterior ‘sample’ sizes v; of each component. v is estimated as 
1-5; E — yy var(ih;)]/ >; var(,), and v; as E(i;)v. More (less) precise estimates of i; imply a better 
(worse) identified discrete mixture and hence a higher (lower) posterior total ‘sample’ size v in the Dirichlet. 
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Oy = T YO (017/02, 09, ...097) 
t 


However, the second density on the right side of (2.11) conditions on 6, fixed at 0,*, 
and requires a secondary run in which only parameters in the B — 1 blocks apart from 6, 
are free to vary (0; is fixed at 6,;* and is not updated). The value of the full conditional 
a(82*|y, 01, 03, .. 0g) is taken at that fixed value of 6), but at the sampled values of 
other parameters, o0, k > 2, i.e. m(82*|y, 01*, 09, ... 00). So 


&(02*l01*, y) = T7! XO m(O2*|01*, 09, 09, ... 09) 
t 


In the third density on the right-hand side of (2.11), both 6; and 0) are known and 
another secondary run is required where all parameter blocks except 6; and 62 vary 
freely, and so on. One may then substitute the logs of the likelihood, prior and estimated 
posterior at 0* in Equation (2.6). Chib (1995) considers the case where latent data z are 
also part of the model, as with latent Normal outcomes in a probit regression; see 
Example 3.1 for a worked illustration. 


2.2.4 Predictive criteria for model checking and selection 


Another approach to model choice and checking is based on the principle of predictive 
cross-validation. In Bayesian applications, this may take several forms, and may lead to 
alternative pseudo Bayes factor measures of model choice. 

Thus, predictions might be made by sampling ‘new data’ from model means for case i 
at each iteration t in an MCMC chain. The sampled replicates Zo for each observation 
are then compared with the observed data, y;. For a normal model with mean pi? for 
case i at iteration ¢, and variance V®, such a sample would be obtained by taking the 
simulations 


2) ~ N(u), vo) 


Such sampling is the basis of the expected predictive approaches of Carlin and Louis 
(2000), Chen et al. (2000) and Laud and Ibrahim (1995). 

Predictions of a subset y, of the data may also be made from a posterior updated only 
using the complement of y,, denoted ypj; see also Section 2.2.5. A common choice 
involves jack-knife type cross-validation, where one case (say case i) is omitted at a 
time, with estimation of the model based only on jj, namely the remaining n — 1 cases 
excluding y;. Under this approach an important feature is that even if the prior 7, and 
hence possibly 7(6|y) is improper, the predictive density 


Pry = my) / mM) = f (8, ypp™(8|yp)d0 


is proper because the posterior based on using only yj in estimating 0, namely 7(6|y;,7), 
is proper. Geisser and Eddy (1979) suggest the product 


my) = | [poia (2.12) 
i=l 


of the predictive densities derived by omitting one case at a time (known as Conditional 
Predictive Ordinates, CPOs) as an estimate for the overall marginal likelihood. The 
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ratio of two such quantities under models Mı and M> provides a pseudo Bayes Factor 
(sometimes abbreviated as PsBF): 


PsBF = [| (p0ilya. Md)/pOilya. Md 
i=1 
Another estimator of the marginal likelihood extends the harmonic mean principle to 
the likelihoods of individual cases: thus the inverse likelihoods for each subject are 
monitored, and their posterior averages obtained from an MCMC run. Then the 
product over subjects of the inverses of these posterior averages, which (see Chapter 
1) are estimates of the conditional predictive ordinates for case i, produces another 
estimator of m( y). The latter may be called the CPO harmonic mean estimator 


my) = J | Po (2.13a) 
where 
1 T 1 j 
pilya) = |T a (2.13b) 
P | z LO J) 


A method supplying an Intrinsic Bayes factor is proposed by Berger and Perrichi (1996), 
and involves defining a small subset of the observed data, yr as a training sample. For 
instance, with a logit regression with p predictors, these samples are of size p 1. The 
posterior for 6 derived from such a training sample supplies a proper prior for analysing 
the remaining data yr). The canonical form of this method stipulates completely flat 
priors for @ in the analysis on the training samples, but one might envisage just proper 
priors being updated by training samples to provide more useful priors for the data 
remainders y;7}. In practice, we may need a large number of training samples, since for 
large sample sizes there are many such possible subsets. 


2.2.5 Replicate sampling 


Predictive checks based on replicate sampling — without omitting cases — are discussed in 
Laud and Ibrahim (1995). They argue that model selection criteria such as the Akaike 
Information Criterion and Bayes Information Criterion rely on asymptotic consider- 
ations, whereas the predictive density for a hypothetical replication Z of the trial or 
observation process leads to a criterion free of asymptotic definitions. As they say, ‘the 
replicate experiment is an imaginary device that puts the predictive density to inferential 
use’. 

For a given model k from K possible models, with associated parameter set 0,, the 
predictive density is 


P(Z|y) = oz 10x )7(0x|v)d 0x 
Laud and Ibrahim consider the measure 
C=) > [{E(Z)-yi}?  var(Z)] (2.142) 
i=l 


involving the match of predictions (replications) to actual data, E(Z;) — yi, and the 
variability, var(Z) of the predictions. Better models will have smaller values of C? or its 
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square root, C. In fact, Laud and Ibrahim define a ‘calibration number’ for model k as 
the standard deviation of C, and base model choice on them. If different models k and m 
provide predictive replicates Zi, and Zim, one might consider other forms of distance or 
separation measure between them, such as Kullback—Leibler divergence. 

Gelfand and Ghosh (1998) generalise this procedure to a deviance form appropriate 
to discrete outcomes, and allow for various weights on the matching component 
E” {E(Z;) — yi}. Thus, for continuous data and for any w > 0, 


C= Svarz) [w/w D] SEZ) — yi? (2.14b) 
i=] i=l 


This criterion may also be used for discrete data, possibly with transformation of both y; 
and z; (Chen and Ibrahim, 2000). 

Typical values of w at which to compare models might be w= 1, w= 10 and 
w = 100000. Larger values of w put more stress on the match between v; and y;, and 
so downweight precision of predictions. 

Gelman et al. (1995) provide an outline of another posterior predictive checking 
(rather than model choice) procedure. Suppose the actual data is denoted yop; and 
that D(yobs;8) is the observed criterion (e.g. a chi-square statistic); similarly, let the 
replicate data and the criterion based on them be denoted ynew and D(Ynew;0). Then a 
reference distribution Pr for the chosen criterion can be obtained from the joint 
distribution of new and 0, namely 


PR(news 0) = P(Ynewl0) T(O|Vobs) 


and the actual value set against this reference distribution. Thus a tail probability, 
analogous to a classical significance test, is obtained as 


Po(Vovs) = PRIDWnew;8) > D(Vobs;9)| Vos] (2.15) 
In practice, D(y®,, 0) and D(yors, 0) are obtained at each iteration in an MCMC 


new? 
run, and the proportion of iterations where D(y2,,, 6) exceeds D(yovs, 0) calculated 
(see Example 2.2). Values near 0 or 1 indicate lack of fit, while mid-range values 
(between 0.2 and 0.8) indicate a satisfactory model. A predictive check procedure is 
also described by Gelfand (1996, p. 153), and involves obtaining 50%, 95% (etc.) 
intervals of the Ynew, ; and then counting how many of the actual data points are located 


in these intervals. 


2.3 ENSEMBLE ESTIMATES: POOLING OVER SIMILAR UNITS 


We now return to the modelling theme of this chapter, in terms of models for smoothing 
a set of parameters for similar units or groups in a situation which does not involve 
regression for groups or members within groups. Much of the initial impetus to 
development of Bayesian and Empirical Bayesian methods came from this problem, 
namely simultaneous inference about a set of parameters for similar units of observation 
(schools, clinical trials, etc.) (Rao, 1975). We expect the outcomes (e.g. average exam 
grades, mortality rates) over similar units (schools, hospitals) to be related to each other 
and drawn from a common density. In some cases, the notion of exchangeability may be 
modified: we might consider hospital mortality rates to be exchangeable within one 
group of teaching hospitals and within another group of non-teaching hospitals, but not 
across all hospitals in both groups combined. Another example draws on recent experi- 
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ence in UK investigations into cardiac surgery deaths: the performance of 12 centres is 
more comparable within two broad operative procedure types, ‘closed’ procedures 
involving no use of heart bypass during anaesthesia, and ‘open’ procedures where the 
heart is stopped and heart bypass needed (Spiegelhalter, 1999) 

The data may take the form of aggregate observations y; from the units, e.g. means for 
a metric variable or numbers of successes for a binomial variable, or be disaggregated to 
observations y; for subjects į within each group or unit of observation j. The data are seen 
as generated by a compound or hierarchical process, where the parameter ); relevant to 
the jth unit is sampled from a prior density at stage 2, and then at stage 1 the observations 
are sampled from a conditional distribution given the unit parameters. 

A related theme but with a different emphasis has been in generalising the standard 
densities to allow for heterogeneity between sample units. Thus the standard densities 
(e.g. binomial, Poisson, normal) are modified to take account of heterogeneity in 
outcomes between units which is greater than postulated under that density. This 
heterogeneity is variously known as over-dispersion, extra-variation or (in the case of 
symmetric data on continuous scales) as heavy tailed data. Williams (1982) discusses the 
example of toxicological studies where proportions of induced abnormality between 
litters of experimental animals vary because of unknown genetic or environmental 
factors. Similarly in studies of illness, there is likely to be variation in frailty or 
proneness À. 

Under either perspective consider the first stage sampling density f(y|A), for a set of n 
observations, y;,i=1,..,, continuous or discrete, conditional on the parameter 
vector A = {Aj,...., An). Often a single population wide value of à (i.e. hj = A for all 
J) will be inappropriate, and we seek to model population heterogeneity. This typically 
involves either (a) distinct parameters \j,.., Nn for each subject i= 1,.., in the 
sample, or (b) parameters \,,.., ày constant within J sub-populations. The latter 
approach implies discrete mixtures (e.g. Richardson and Green, 1997; Stephens, 
2000), while the first approach most commonly involves a parametric model, drawing 
the random effects \; from a hyperdensity, with form 


A ~ TAO) 


In this density the 0 are sometimes called hyperparameters (i.e. parameters at the second 
or higher stages of the hierarchy, as distinct from the parameters of the first stage 
sampling density). They will be assigned their own prior 7(8), which may well (but not 
necessarily always) involve further unknowns. If there are no higher stages, the marginal 
density of y is then 


mo) = | [Omaloma (2.16) 


For example, consider a Poisson model y ~ Poi(A), where y is the number of nonfatal 
illnesses or accidents in a fixed period (e.g. a year), and \ is a measure of illness or 
accident proneness. Instead of assuming all individuals have the same proneness, we 
might well consider allowing à to vary over individuals according to a density m(A|6), 
for instance a gamma or log-normal density to reflect the positive skewness in prone- 
ness. Since à is necessarily positive, we then obtain the distribution of the number of 
illnesses or accidents (i.e. the marginal density as in Equation (2.16) above) as 


Pry = k) = [fo exp ( — A)/klln(Aà|0)n(0)dàdO 
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where the range of the integration over À is restricted to positive values, and that for 0 
depends upon the form of the parameters 0. In this case 


E(y) = EQ) 
and 
Var(y) = EN) Var) (2.17) 


so that Var(\) = 0 corresponds to the simple Poisson. It is apparent from Equation 
(2.17) that the mixed Poisson will always show greater variability than the 
simple Poisson. This formulation generalises to the Poisson process, where counts 
occur in a given time ¢ or over a given population exposure E. Thus, now y ~ Poi(A2) 
over time of observation period t, or y ~ Poi(AE), where y might be deaths in areas 
and E the populations living in them. The classic model for a mixed Poisson 
process (Newbold, 1926) assumes that à for a given individual is fixed over time, 
and that there is no contagion (i.e. influence of past illnesses or accidents on future 
occurrences). 

The model choice questions include assessing whether heterogeneity exists and if so, 
establishing the best approach to modelling it. Thus, under a discrete mixture approach, 
a major question is choosing the number of sub-populations, including whether one 
sub-population only (i.e. homogeneity) is the best option. Under a parametric approach 
we may test whether there is in fact heterogeneity, i.e. whether a model with var(\) 
exceeding zero improves on a model with constant over all subjects, and if so, what 
density might be adopted to describe it. 


2.3.1 Mixtures for Poisson and binomial data 


Consider, for example, the question of possible Poisson heterogeneity or extravariation 
in counts O; for units 7 with varying exposed to risk totals such that E; events are 
expected. An example of this is in small area mortality and disease studies, where O; 
deaths are observed as against E; deaths expected on the basis of the global death rate 
average or more complex methods of demographic standardisation. Then a homoge- 
neous model would assume 


O; ~ Poi(LE;) 
with Z a constant relative risk across all areas, while a heterogeneous model would take 

Oi ~ Poi(A;£;) 

hi ~ TAJO) 
with 7(A|9) a hyperdensity. For instance, if a gamma prior G(a,b) is adopted for the 
varying relative risks \;’s, then E(\) = a/b and var(\) = a/b? = E(N)/b. The third stage 
might then be specified as 

a~ E(1) 

b ~ G(1, 0.001) 
that is in terms of relatively flat prior densities consistent with a and b being positive 
parameters. 


Whatever mixing density is adopted for A, an empirical moment estimator (Bohning, 
2000) for 77 = var(A), is provided by 


44 HIERARCHICAL MIXTURE MODELS 
3 =1/nl)_ (0; — EL) /E7} -— LY > (1/E3] (2.18) 


and indeed, might be used in setting up the priors for a and b. 
Heterogeneity may also be modelled in a transform of \; such as log(A;). This 
transformation extends over the real line, so we might add a normal or student ¢ error u; 


log (Ai) =K Ui 


This approach is especially chosen when à is being modelled via a regression or in a 
multi-level situation, since one can include the fixed effects and several sources of extra- 
variability on the log scale (see Chapters 3 and 4). 

For binomial data, suppose the observations consist of counts y; where an event 
occurred in populations at risk n;, with 


yi ~ Bin(p;, ni) 


Rather than assume p; = p, suppose the parameters for groups or subjects i are drawn 
from a beta density 


pi ~ Beta(a, B) 


The hyperparameters {a, B} may themselves be assigned a prior, 7(a, B), at the second 
stage, though sometimes a and or B are assumed to be known. For instance, taking 
known hyperparameter values a = B = 1 is the same as taking the p;’s to be uniform 
over (0, 1). If a and B are taken to be unknowns, then the joint posterior density of 
{a, B, pi} is proportional to 


x(a, Ba DTO TT pa-p TT pha por 
=f i=1 


The full conditional density of the p; parameters can be seen from above to consist of 
beta densities with parametersa y;and B  n;— yi. 

An alternative approach to binomial heterogeneity is to include a random effect in the 
model for logit(p;). This is sometimes known as the logistic-normal mixture (Aitchison 
and Shen, 1980); see Example 2.3. 


Example 2.1 Hepatitis B in Berlin regions As an illustration of Poisson outcomes 
subject to possible overdispersion, consider the data presented by Bohning (2000) on 
observed and expected cases of Hepatitis B in 23 Berlin city regions, denoted 
{O;, Ei} i= 1, .., 23. Note that the standard is not internalř, and so X;E; = 361.2 
differs slightly from 11,0; = 368. We first test for heterogeneity by considering a single 
parameter model 


O; ~ Poi(LE;) 


and evaluating the resulting chi-square statistic, 


5 An internal standardisation to correct for the impact of age structure differences between areas (on an 
outcome such as deaths by area) produces expected deaths or incidence by using age-specific rates defined for 
the entire region under consideration (e.g. Carlin and Louis, 2000, p. 307). Hence, the standard mortality ratio 
or standard incidence ratio for the entire region would be 100. An external standard means using a national or 
some other reference set of age-specific rates to produce expected rates for the region and areas within it. 
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The overall mean relative risk in this case is expected to be approximately 368/361.2, 
and a posterior mean Ê= 1.019 is accordingly obtained. The chi square statistic 
averages 195, with median 193.8, and shows clear excess dispersion. The above moment 
estimator (2.18) for regional variability in hepatitis rates, 77, has mean 0.594. 

A fixed effects model might be adopted to allow for such variations. Here the 
parameters à; are drawn independently of each other (typically from flat gamma priors) 
without reference to an overall density. In practice, this leads to posterior estimates very 
close to the corresponding maximum likelihood estimate of the relative incidence rate 
for the ith region. These are obtained simply as 


Ri = O;/E; 


Alternatively, a hierarchical model may be adopted involving a Gamma prior G(a,b) for 
heterogeneous relative risks \;, with the parameters a and b themselves assigned flat 
prior densities confined to positive values (e.g. Gamma, exponential). So with 
A; ~ G(a, b) and 


a~ G(h, J), b ~ G(K;, Ka) 
where Jı, J2, Kı and K are known, then 
O; N Poi(A;£;) 


Here take J; = K; = 0.001 for i= 1, 2. Running three chains for 20000 iterations, 
convergence is apparent early (at under 1000 iterations) in terms of Gelman—Rubin 
statistics (Brooks and Gelman, 1998). While there is a some sampling autocorrelation in 
the parameters a and b (around 0.20 at lag 10 for both), the posterior summaries on 
these parameters are altered little by sub-sampling every tenth iterate, or by extending 
the sampling a further 10000 iterations. 

In terms of fit and estimates with this model, the posterior mean of the chi square 
statistic comparing O; and u; = d,£; is now 23, so extra-variation in relation to avail- 
able degrees of freedom is accounted for. Given that the \;’s are smoothed incidence 
ratios centred around 1, it would be anticipated that E(N) ~ 1. Accordingly, posterior 
estimates of a and b are found that are approximately equal, with a = 2.06 and b = 2.1; 
hence the variance of the A;’s is estimated at 0.574 (posterior mean of var(A)) and 0.494 
(posterior median). Comparison (Table 2.1) of the unsmoothed incidence ratios, R;, and 
the \;, shows smoothing up towards the mean greatest for regions 16, 17 and 19, each 
having the smallest total (just two) of observed cases. Smoothing is slightly less for area 
23, also with two cases, but higher expected cases (based on a larger population at risk 
than in areas 16, 17 and 19), and so more evidence for a low ‘true’ incidence rate. 

Suppose we wish to assess whether the hierarchical model improves over the 
homogenous Poisson model. On fitting the latter an average deviance of 178.2 is 
obtained or a DIC of 179.2; following Spiegelhalter et al. (2002) the AIC is obtained 
as either (a) the deviance at the posterior mean D(6) plus 2p, or (b) the mean deviance 
plus p. Comparing D and D(@) under the hierarchical model suggests an effective 
number of parameters of 18.6, since the average deviance is 119.7, but the deviance at 
the posterior mean (defined in this case by the posterior averages of the \,’s) is 101.1. 
The DIC under the gamma mixture model is 138.3, a clear gain in fit over the 
homogenous Poisson model. 
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Table 2.1 Regional relative risks: simple maximum likelihood fixed effects and Poisson-Gamma 
mixture models 


Unsmoothed Incidence ratios 2.5% Median 97.5% 
incidence ratios from hierarchical 
smoothing 

Region 1 2.66 2.42 1.62 2.39 3.34 
Region 2 1.42 1.39 0.93 1.37 1.95 
Region 3 2.92 2.77 2.07 2.75 3.59 
Region 4 1.53 1.51 1.03 1.49 2.08 
Region 5 0.71 0.75 0.44 0.74 1.15 
Region 6 1.01 1.02 0.60 1.00 1.54 
Region 7 0.61 0.69 0.30 0.66 1.23 
Region 8 1.99 1.92 1.34 1.90 2.59 
Region 9 0.89 0.91 0.55 0.90 1.35 
Region 10 0.38 0.45 0.21 0.43 0.79 
Region 11 1.31 1.31 0.95 1.30 1.72 
Region 12 0.68 0.71 0.43 0.70 1.07 
Region 13 1.75 1.62 0.95 1.59 2.52 
Region 14 0.69 0.73 0.39 0.71 1.18 
Region 15 0.91 0.94 0.49 0.91 1.53 
Region 16 0.20 0.34 0.09 0.31 0.74 
Region 17 0.18 0.32 0.08 0.29 0.70 
Region 18 0.48 0.54 0.27 0.53 0.91 
Region 19 0.38 0.55 0.14 0.51 1.17 
Region 20 0.27 0.38 0.12 0.36 0.79 
Region 21 0.54 0.59 0.31 0.58 0.95 
Region 22 0.35 0.44 0.18 0.42 0.82 
Region 23 0.15 0.28 0.07 0.25 0.62 


Example 2.2 Hot hand in baseball This example considers data on shooting percent- 
ages in baseball, as obtained by Vinnie Jones over the 1985-89 seasons, and used by 
Kass and Raftery (1995) to illustrate different approximations for Bayes factors. The 
question of interest is whether the probability of successfully shooting goals p is 
constant over games, as in simple binomial sampling (model M,), so that 


vi ~ Bin(p, ni), 


where n; are attempts. Alternatively, under M3 the hypothesis is that Vinnie Jones has 
a ‘hot hand’ — that is, he is significantly better in some games than would be apparent 
from his overall average. The latter pattern implies that p is not constant over 
games, and instead there might be extra-binomial variation, with successful shots y; 
binomial with varying probabilities p; in relation to all attempts, n; (successful or 
otherwise): 


yi ~ Bin(p;, ni) 


pi ~ Beta(a, B) RR 


Here the models (2=Beta-Binomial vs. 1=Binomial) are compared via marginal likeli- 
hood approximations based on importance sampling. 
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There are other substantive features of Jones’ play that might be consistent with a hot 
hand, such as runs of several games with success rates y;/n; larger than expected under 
the simple binomial. Here, rather than global model hypothesis tests, a posterior 
predictive check approach might be used under the simple binomial sampling model. 
This entails using different test statistics applied to the observed and replicate data, y 
and ynew, and preferably statistics that are sensible in the context of application. In 
particular, Berkhof et al. (2000) test whether the maximum success rate max;{Vjnew/Ni} 
in the replicate data samples exceeds the observed maximum. 

In the binomial model, the prior p ~ B(1, 1) is adopted, and one may estimate the 
beta posterior density B(a,, b,) of p using moment estimates of the parameters. Thus, if 
kp is the posterior mean of p and V, its posterior variance, then H = a, bp is estimated 
as [kp(1 — kp) — Vp]/ Vp. Thus with k, = 0.457 and V9? = 0.007136, a posterior beta 
density with ‘sample size’ H = 4872 is obtained. 

In the beta-binomial, Kass and Raftery (1995, p. 786) suggest reparameterising the 
beta mixture parameters in Equation (2.19). Thus a = v/w, B = (1 — v)/w, where both 
œw and v are assigned B(1, 1) priors — equivalent to uniform priors on (0, 1). The 
posterior beta densities of the p;’s in Equation (2.19) are approximated using the 
moment estimation procedure, and similarly for the posterior beta densities of w and v. 

It may be noted that the beta-binomial model is not especially well identified, and 
other possible priors such as vague gamma priors on the beta mixture parameters 
themselves, e.g. 


a ~ G(0.001, 0.001), B ~ G(0.001, 0.001) 


have identifiability problems. With the reparameterised version of the model conver- 
gence for w is obtained from the second half of a three chain run with 10 000 iterations, 
with posterior mean of œ at 0.0024 and posterior standard deviation 0.002. 

In a second run, iterations subsequent to convergence (i.e. after iteration 10000) 
record the prior, likelihood and (approximate) posterior density values as in Equation 
(2.9), corresponding to the sampled parameters of the binomial and beta-binomial 
models, namely {p} and {p;, w, v}. Then with 1000 sampled parameter values and 
corresponding values of 8 = log(g) — [log(L®) log (m™)], the approximate mar- 
ginal likelihoods under models 2 and 1 are —732.7 and —729.1, respectively. This leads 
to a Bayes factor in favour of the simple binomial of around 35. 

Table 2.2, by contrast, shows that the beta-binomial has a slightly higher likelihood 
than the binomial. The worse marginal likelihood says in simple terms that the im- 
proved sampling likelihood obtained by the beta-binomial is not sufficient to offset the 
extra parameters it involves. Kass and Raftery (1995, p. 786) cite Bayes factors on Mı 
between 19 and 62, depending on the approximation employed. 

Features of the game pattern such as highest and lowest success rates, or runs of ‘cold’ 
or ‘hot’ games (runs of games with consistent below or above average scoring) may or may 
not be consistent with the global model test based on the marginal likelihood. Thus, 
consider a predictive check for the maximum shooting success rate under the simple 
binomial, remembering that the observed maximum among the y;/n; is 0.9. The criterion 


Pr( max {Ynew/n} > max {y/n}) 


is found to be about 0.90 — this compares to 0.89 cited by Berkhof et al. (2000, p. 345). 
This ‘significance rate’ is approaching the thresholds which might throw doubt on the 
simple binomial, but Berkhof et al. conclude that is still such as to indicate that the 
observed maximum is not unusual or outlying. 
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Table 2.2 Posterior summary, baseball goals, binomial and beta-binomial parameters 


Beta-binomial Mean St. devn. 2.50% Median 97.50% 


w 0.00241 0.00205 0.00009 0.00193 0.00773 
a 683.5 1599 58.49 236.6 5159 

B 808.1 1878 70.61 282.7 6004 

Log likelihood —721.3 4.416 —727.5 —722.1 —711.4 
SD(p) in beta-binomial 0.013 

Binomial 

p 0.457 0.007 0.441 0.457 0.474 
Log likelihood —725.6 0.7 —728.3 —725.3 —725.1 


Example 2.3 Cycles to conception Weinberg and Gladen (1986) consider differences 
in the number of fertility cycles to conception according to whether the woman in each 
couple smoked or not. For i= 1, .., 100 women smokers, 29 conceived in the first 
cycle, but from 486 non-smokers, 198 (or over 40%) conceived in this cycle. The full 
data, given in Table 2.3, consist of the number y of cycles required according to smoking 
status, with the last row relating to couples needing over 12 cycles. 


Such an outcome is a form of waiting time till a single event, but in discrete time units 
only, and can be modelled as a geometric density. This is a variant of the negative 


binomial (see Chapter 1) which counts time intervals y until r events occur, when the 
success rate for an event is p. The negative binomial has the form 


Pr(y) = Cija — py" (2.20) 


and it follows that the number of intervals until r = 1 (e.g. cycles to the single event, 
conception) is 


Table 2.3 Cycles to conception 


Cycle Non-smokers Cumulative Smokers Cumulative 
proportion proportion 
conceiving conceiving 

1 198 0.41 29 0.29 
2 107 0.63 16 0.45 
3 55 0.74 17 0.62 
4 38 0.82 4 0.66 
5 18 0.86 3 0.69 
6 22 0.90 9 0.78 
7 7 0.92 4 0.82 
8 9 0.93 5 0.87 
9 5 0.94 1 0.88 

10 3 0.95 1 0.89 

11 6 0.96 1 0.9 

12 6 0.98 3 0.93 

Over 12 12 1.00 7 1 


Total 486 100 
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Pr(y) = (1— py"! p 
So p is equivalently the chance of conception at the first cycle (when y = 0). 

Consider first a constant probability model (Model 1) for couples within the smoking 
group and within the non-smoking group, so that there are just two probabilities to 
estimate, pı for smokers and pz for non-smokers. Under this density, the probability of 
more than N cycles being required is 


Priy>N)=p >) Q-p™'=0-p" (2.21) 
i=N 1 
In the present example, N = 12. Note that this is an example of censoring (non- 
observation) of the actual cycles to conception; only the minimum possible cycle 
number for such couples is known. 

The likelihood (2.20) and (2.21) may be modelled via the non-standard density option 
available in BUGS (the dnegbin option in BUGS might also be used but is complicated by 
the censoring). Thus, for a density not available to sample from in BUGS, an artificial data 
series Z; of the same length Nas the actual data is created, with Z; = 1 for all cases. Then, if 


C; are the number of conceptions at cycle i, i= 1, .., N, and if there are no groups to 
consider (such as the smoking and non-smoking groups here), the density for Z is 
Z; ~ Bern(L;) 


where L; is the likelihood defined by 
L; = [(1 -p pl 


where Bern() denotes Bernoulli sampling. The corresponding coding in BUGS (with 
N = 12 and a B(1, 1) prior for p) is 


{for (i in 1:N) {Z[i] <- 1 
Z[i] ~ dbern(L[i]) 
L[i] <- pow(r[i],C[i]) 
r[i] <- pow (1-p,i-1)*p} 
p ~ dbeta(1, 1)}. 


In the example here, this approach is extended so that at each cycle i there are J groups 
of cases (here J = 2 for smoking and non-smoking women), so that 


Z; ~ Bern(L;) i=1,.., N; j=1,..,J 
Ly =[0 — py! pl 


Alternatively (Model 2), one might allow for extra-variability in conception chances 
with a beta mixture at the couple level. This is known as a beta-geometric mixture, and 
is analogous to the more commonly encountered beta-binomial. There are now distinct 
probabilities piy for couples k = 1, ... C; taking i cycles to conceive (with i= 1, .. N). 
Disregarding possible grouping, and continuing the artificial data device, the coding 
corresponds to the model 


Zik = 1 


Zir ~ Bern(Liz 

: ce (2.22) 
Lik = (1 — pix)” Pik 
peN Blok Bod =a SO 
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with the total likelihood for subjects taking i cycles being 


Ci 
Li=]| Le 
k=1 


Program 2.3 includes this approach with the data arranged as a single string of length 
UC). 

Under the constant probability model (within the two groups), the posterior means 
for the conception probabilities are pı = 0.23 and p2 = 0.33, with 95% credible intervals 
(0.19, 0.27) and (0.31, 0.36). The predicted cycle distribution under this model, denoted 
nhat[,] in Program 2.3, under-predicts both short and long cycles to conception frequen- 
cies (1 cycle and over 12 cycles) — see Table 2.4. This brings into doubt the constant 
probability model. 

For the heterogeneous case, with a mixture of conception probabilities over couples, 
attention is confined to couples where the female partner smokes. The data are then 
defined as a vector of 100 couple level observations of cycles required, with 29 observa- 
tions requiring one cycle, 16 requiring two and so on, with the couples needing over 12 
cycles coded to 13. As well as the artificial data device, a reparameterisation as in 
Example 2.2 is employed for the beta mixture parameters. 

A three chain run taken to 10000 iterations shows apparent convergence from 1000 
iterations (the beta mixture parameters are relatively slow to converge according to 
Gelman—Rubin criteria). The average conception probability from the beta parameters, 
namely 


P=a/(a_ P) 


is 0.29 with 95% credible interval from 0.22 to 0.37. Short and long cycles to conception 
frequencies are much more closely reproduced under this model (Table 2.5). The 
credible interval for these frequencies includes the actual frequencies of 29 and 7, 
respectively. The benefits from allowing conception chances to vary over couples are 
evident. 


Table 2.4 Cycles to conception, predictions under constant probability model 


Smokers Non-Smokers 
Cycle Mean 2.5% 97.5% Mean 2.5% 97.5% 
1 22.6 18.7 26.7 161.3 149.8 172.9 
2 17.4 15.2 19.6 107.7 103.6 111.4 
3 13.5 12.4 14.4 71.9 71.6 72.0 
4 10.4 10.0 10.6 48.1 46.2 49.6 
5 8.1 TaT, 8.2 32.1 29.8 34.3 
6 6.3 5.6 6.6 21.5 19.2 23.7 
7 4.8 4.1 5.4 14.4 12.4 16.4 
8 3.8 3.0 4.4 9.6 8.0 11.4 
9 2.9 2,2 3.6 6.4 5.1 7.9 
10 2.3 1.6 2.9 4.3 3.3 5.4 
11 1.8 1.2 2.4 2.9 2.1 3.8 
12 1.4 0.9 1.9 1.9 1.4 2.6 


Over 12 4.9 2.4 8.4 3.9 235 5.8 
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Table 2.5 Couples with female partner as smokers: predicted cycle 
distribution under mixture model 


Mean St. devn. 2.50% 97.50% 
P 0.29 0.04 0.22 0.37 
a 3.36 3.75 1.16 11.61 
B 9.23 13.33 2.25 36.77 
Cycle 
1 28.9 3.4 22.5 35.6 
2 18.2 0.9 16.4 19.8 
3 12.2 0.8 10.6 13.6 
4 8.6 0.8 7.1 10.1 
5 6.3 0.7 5.0 FT 
6 4.7 0.6 3.7 5.9 
7 3.6 0.5 2.8 4.6 
8 2.8 0.4 2.1 3.6 
9 2.2 0.3 1.7 2.9 
10 1.8 0.2 1.4 2.3 
11 1.4 0.2 1.1 1.9 
12 1.2 0.2 0.9 1.5 
Over 12 8.2 22 4.2 12.9 


In practice, couple characteristics (e.g. the woman’s age) might be expected to affect 
the chance of conception, and one might then link the p;x’s in Equation (2.22) to 
predictors via a logit regression. Remaining variability between couples might then be 
modelled via a Normal error in the logit link. In fact, this approach might be adopted 
for all 586 couples using the female smoking status Sip (=1 for smokers, 0 otherwise) as 
a predictor, so that 


logit(pix) = Yo ViSik Uik 


with uz Normal. 


2.3.2 Smoothing methods for continuous data 


For metric data, assumed initially at least to be approximately Normal, a typical 
problem involves two way data with i= 1, .., n; replicated observations y; within 
groups j=1,.., K. Sometimes the observed data may be provided only as group 
averages y; aggregating over individual observations, though with details on the vari- 
ability within groups (or on 95% intervals for the mean). Assuming the observed means 
are derived from similar observation settings or similar types of unit, they may be 
regarded as draws from an underlying common density for the unknown true means 
p.;- This assumption leads to a hierarchical model, with the first stage specifying the 
density of the observations, and the second (and maybe higher) stages specifying 
hyperparameters which underlie the observations. Typically, the underlying cell means 
w are taken to differ by group but the variance o? (and so also the precision t = 07°) is 
assumed constant over groups. In analysis of variance situations, the goal may be to 
assess additionally whether the underlying group means are equal. For data in a one 
group y;, i= 1, .., n, the higher stage density involves a single mean and variance. 
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While a symmetric, unimodal density such as the Normal is often appropriate, some 
work has focused on skewed options to the Normal (Leonard, 1980; Fernandez and 
Steel, 1998), which may involve a different variance according to whether the observa- 
tion is located above or below the mean. Other options model skewness via latent ‘factor 
scores’ introduced into the prediction of the regression mean (Branco and Dey, 2001). 
Other departures from symmetry or unimodality may be modelled by discrete mixtures 
(see Section 2.4). 

Even if the assumption of symmetry is accepted, other features of the Normal density 
may be inappropriate. There are issues of robustness to outlier observations or param- 
eters. The exceptional observation(s) may indicate a robust (heavy-tailed) alternative to 
the Normal so that estimates of summary parameters at the second stage are not 
distorted, while an exceptional parameter (e.g. one of the means p) will indicate an 
alternative second stage prior to the Normal (Gelman et al., 1995, Chapter 12). 

However, suppose initially the observations are assumed to be Normally distributed 
such that 


Equivalently, the likelihood can be specified in terms of the means of the observations 
within group j, namely y;, also with means p, but with variances o” /n; (and precisions 
njt, where t = 1/07), so that 


Jile; ~ N(wj, 07/1) (2.23) 


Often — for example in meta analyses — the observed data will be available in the form of 
summary statistics y; such as odds ratios, with sample sizes and sampling variances V; 
for the summary statistics also provided. It may be noted that a Normal approximation 
to the distribution of the means or other summary measures may remain appropriate, 
even if the original data are non-Normal. Thus the y; might be the logged odds ratios 
resulting from 2 x 2 tables of case versus exposure status, and the variances V; would 
have been obtained by the usual formula in such applications (Woodward, 1999). 

Suppose the second stage prior for the p,’s is also assumed as Normal with mean 
M and variance p°. At the highest stage a gamma prior for the inverse variance 1/7 
might be adopted, and a flat (though proper) prior for the mean. Inferences about the p; 
are likely to be sensitive to the prior on the higher level variance or precision: to avoid 
over-smoothing, the prior for 1/b? may be specified to avoid high precision (see 
Example 2.6). 

Inferences for data assumed Normal, for instance regarding cell means in one or two- 
way analysis of variance situations, may be influenced by discordant observations. In 
these cases the heavier tailed alternatives to the univariate and multivariate Normal, 
such as the Student ¢ in its univariate and multivariate forms (with additional degrees of 
freedom parameter, v) may provide more robust inferences with regard to the location 
of the overall mean. Small values of v (under 10) indicate that Normality of the data is 
doubtful, while values in excess of 50 are essentially equivalent to Normality. It may 
also be necessary (see Example 2.5) to account for outlier parameters at the second or 
higher stages of a hierarchical model, such as a discrepant p,. Heavy tailed departures 
from non-Normality may also be modelled by exponential power distributions, of which 
the double exponential is an example (Box and Tiao, 1973, Chapter 3). 

As noted in Chapter 1, the Student ¢ density is obtainable as a scale mixture of the 
Normal. Thus 
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yi~N(w’?/d) i=l,..,n 


with the variance scaling parameters or weights \; being drawn from a G(v/2, v/2) 
density. If additionally v is treated as an unknown parameter, then a flat prior is not 
appropriate (Geweke, 1993), and an exponential prior is one option; also for the 
posterior mean of wp to exist, values of v € [0, 2] are not admissible. The scaling 
parameters are lowest for observations which are discrepant (i.e are potential outliers) 
from the main set; the scale mixture approach to outlier detection is exemplified by 
authors such as West (1984) and DeFinetti (1961). 

In fact, specifying the ,’s in this form is one option in a broader class of scale mixture 
models in which the prior for the weights does not necessarily refer to the degrees of 
freedom parameter (Geweke, 1993, p. S23). In general, 


yi ~ N(p, 0° /wi) (2.24) 


where w; are positive random variables with average 1 representing variance heterogen- 
eity. Markedly lower values of w; correspond to suspect values of y, only included in the 
density in Equation (2.24) by inflating the variance to accommodate them. 


Example 2.4 Univariate Scale Mixture: the Darwin data An often cited example of 
distortion in the univariate Normal parameters caused by potentially outlying points is 
the data obtained by Darwin relating to the benefits of cross-fertilisation in plants. 
Specifically, he noted differences in heights (in eighths of an inch) within pairs of corn 
plants, one cross-fertilised and the other self-fertilised. Here variations on the first stage 
density are introduced, and their impact assessed on the summary measures (hyper- 
parameters) of location and variability. 


The data y; are differences from 15 plant pairs, as follows: 
49, —67, 8, 16, 6, 23, 28, 41, 14, 29, 56, 24, 75, 60, —48 
Suppose the first stage density for these data is taken as Normal. A run of 10000 
iterations with 1000 burn in provides a posterior mean (95% credible interval) for the 
Normal mean p of 20.9 (1.6, 40.1) and a posterior median on øo? of 1300. The mean is 
thus imprecisely estimated, though its 95% credible interval is confined to positive 
values. 

An alternative sampling density is a univariate® Student t, y; ~ t,(u, o°), without an 
explicit scale mixture (Model 2 in Program 2.4), one aspect of which is to assess the 
density of the degrees of freedom parameter v. Here an exponential prior for v is 
assumed, v ~ E(k), with x itself assigned a uniform prior U(0.01, 0.5). This is approxi- 
mately equivalent to assuming the degrees of freedom lies within the range 2-100. 
Assigning a preset value such as k = 0.1 would be a more informative option. There 
is also a formal constraint on v excluding the range v € [0, 2]. A flat gamma prior for 
t = 1/o° is adopted, namely t ~ G(0.0001, 0.0001), and an N(0, 10’) prior on p. One 
set of initial values for a three chain run is based on null start values, the others on the 
mean and 97.5% point of a trial run. 

Early convergence of k is apparent, but v itself only converges after about 150000 
iterations. Table 2.6 is based on iterations 150 000-200 000. A median value of 10 for the 


€ In BUGS this involves the code 
yli] ~ dt(mu, tau, nu) 


with tau as precision, and nu as degrees of freedom. 
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Table 2.6 Student ¢ parameters for fertilisation data, degrees of freedom unknown 


Parameter Mean St. devn. 2.5% Median 97.5% 
K 0.11 0.12 0.01 0.06 0.47 
Mean 23.90 9.84 —1.06 24.17 46.72 
v 29.57 52.66 2.07 10.17 289.30 
Variance 1225.00 709.20 237.10 1088.00 3655.00 


degrees of freedom is obtained (though the posterior mean for v is 30). The posterior 
median of ø? is lower than in the Normal analysis, namely 1090, and the Normal mean 
is higher — though not more precisely identified — with an average of 24. 

This analysis illustrates how inferences about the second stage parameters may be 
distorted by incorrectly assuming a Normal first stage density, but does not indicate 
which observations are the source of the heavy tails. In a third analysis (Model 3 in 
Program 2.4), the scale mixture version of the Normal is applied, with v still an 
unknown. Taking the prior on the degrees of freedom as above we obtain — from 
iterations 50000-100000 of a three chain run — factors \; for each sample member 
that provide a measure of outlier status. It may be noted that sampling convergence is 
improved by this approach as compared to the standard Student ¢ form above. The 
lowest A; (i.e. most suspect points) are for corn plant pairs 2 and 15, the observations for 
which are both highly negative (Table 2.7). The high positive value for pair 13 is also 
suspect. The Normal mean parameter p is again estimated to have posterior mean 24, 
and the median variance is around 1050. 


Table 2.7 Parameters for scale mixture density of fertilisation data 


Mean St. devn. 2.5% Median 97.5% 
Mean 24.1 9.8 3.9 24.4 42.8 
Degr. Freedom 23.7 40.8 22 8.7 140.8 
Variance 1199 706 304 1058 2950 


Weights for Individual Cases 


Mean St. devn. 2.5% Median 97.5% 
NM 1.03 0.53 0.22 0.97 2.37 
A2 0.59 0.36 0.05 0.57 1.29 
A3 1.06 0.56 0.24 0.99 2.47 
Ng 1.11 0.60 0.26 1.01 2.68 
Ns 1.05 0.55 0.23 0.98 2.43 
N6 1.14 0.63 0.27 1.02 2.78 
A7 1.14 0.62 0.28 1.02 2.78 
dg 1.09 0.58 0.24 1.00 2.57 
No 1.10 0.60 0.25 1.01 2.64 
N10 1.14 0.63 0.27 1.02 2.78 
Ni 0.97 0.49 0.19 0.93 2.17 
Ny 1.14 0.62 0.27 1.03 2.78 
AB 0.82 0.42 0.12 0.83 1.74 
N14 0.94 0.47 0.17 0.91 2.07 


N15 0.68 0.38 0.07 0.68 1.45 
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Example 2.5 Labour market rateable values As an example of the impact of adopting 
a multivariate ¢ density at the first or higher stage, consider data on four types 
of Rateable Value (RV) change in four groups of Labour Market Area in England 
and Wales over 1966-71 (Kennett, 1983, Table 9.4). The n=4 Labour Market 
groups in this example are either cores, rings, outer rings, or unclassified, with the 
p = 4 variables being the change in per cent in total RV, in domestic RV, in commercial 
RV and in industrial RV. These data contain extreme values on industrial RV change, 
which cast doubt on Normality and raise problems in estimating the mean for this 
variable. 


It may be noted that in BUGS, the direct form of the multivariate ¢ (i.e. not based on 
scale mixing) for a p-dimensional outcome y; involves the command 


yl, 1: p] ~ dmt(muf[1: p], P[, ], nu), 


where nu (the degrees of freedom) is usually’ preset (Model A in Program 2.5). A profile 
of the log-likelihood or some predictive measure like the expected predictive deviance 
over different values of the degrees of freedom is then one approach to assessing both 
the extent of departures from multivariate normality and the most appropriate degrees 
of freedom. For illustration, a multivariate ¢ likelihood at the first stage is compared 
between two options for the degrees of freedom, v = 20 and v = 100 (the latter effect- 
ively equivalent to multivariate normality). There is, in fact, very little to choose 
between these models in terms of likelihood. With runs of 50000 iterations, median 
log-likelihoods are —53.7 for v = 100 and —53.9 for v = 20. The posterior for industrial 
RV change when v = 20 has 95% credible interval from —29 to 78, with mean 25. 

One may obtain outlier indicators à; by specifying a univariate Normal measurement 
error model at the first stage, and a multivariate ¢ (via scale mixing) in the second stage 
model for the means (Model 2 in Program 2.5): 


Yj NG) i=1,.. nh j=1,..,p 
Bg = Ej 

Ej = Ny/ Vi 

hi ~ G(v/2, v/2) 

Ni, tp ~ Np(0, È) 


where y; are assigned a just proper Normal prior. The variances 1; specify the expected 
level of measurement error, and one might assume relatively informative priors 
favouring lower error variances. The p; are then the underlying multivariate form of 
the data. Conclusions about $, and the outlier indicators may be sensitive to the 
assumed qj, especially for small samples as here. As above the degrees of freedom is 
assigned an exponential prior E(k), excluding values under 2, and with x itself a free 
parameter with prior kx ~ U(0.01, 0.5). 

It may be noted that this approach might be used with joint discrete outcomes subject 
to extra-variation, where such extra-variation was suspected to be located in relatively 
few observations. For example, if Yj were pairs of Poisson outcomes such as deaths 
from cause A and from cause B, with expectations (in the demographic sense) £,, then 
one might assume 


7 Though a discrete prior on different values of d.f. is possible. 
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Yy ~ Poilu; Ei) 
log (M) =Y Ej 


with the ¢j’s specified as above. 

In the rateable values example, three chains are run for 10000 iterations from over- 
dispersed initial values (two sets of which are based on a trial run, the other being a 
‘null’ start point) with convergence after 2500 iterations. Informative priors on the 
inverse variances of the measurement error are based on such error accounting for 
approximately 10% of the total variance® of each variable. The outlier indicators show 
in fact that the core areas have lowest weight (around 0.55), though the posterior 
distributions of all the \;’s have substantial spread. The degrees of freedom has median 
4. Industrial RV change is more precisely identified under this mixture model approach, 
with 95% interval 2 to 52, and mean 28. Of course, other priors on the measurement 
error variances might well be applied to assess sensitivity. 


Example 2.6 Coaching programs In this example we consider a cell means analysis of 
a metric outcome, defined by the effects y; of coaching programs for the Scholastic 
Aptitude Test-Verbal in K = 8 US high schools. The data are available as the estimated 
coaching effects and their standard errors s; after an analysis of covariance using scores 
from an earlier test administration. The goal is to estimate underlying means pw;, the 
prior for which expresses a belief that such means are drawn from a common distribu- 
tion. Assume the effects y; are draws from a hierarchical model with the first stage 
sampling variances V; provided by the squared standard errors s, so that 


with the second stage prior specifying a common Normal distribution 
w; ~ N(M, 4”). 


As one route to establish a prior on ¢ that avoids over-smoothing of the p; consider the 
harmonic mean of the observed V;, given by 


K 
w5 uv, 
j=1 


J 


which is s = 139, with square root sọ = 11.8. This mean is the basis for a Pareto prior 
density for , as proposed by DuMouchel (1990), involving so as a guide quantity in the 
prior. Specifically, the prior 


mb) = so/(so by (2.25) 


is still relatively vague but has median so. 
Alternatively, a prior may be set on ¢” or its inverse @ 7, which incorporates (though 
downweights) the guide value sĝ or its inverse s)*. Suppose a prior is set as follows: 


b * ~ Ga, b) 


where a = 1/k and b = s4/«, where k is a downweighting constant in excess of 1. Large 
values of k (e.g. k = 1000) correspond to just proper priors. 


8 The moment variances V; are 11,22,6 and 800, and G(10, V;) priors on the inverse variances of the 
measurement errors are adopted. 
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With an arbitrary value of k = 10, a three chain run (Model A in Program 2.6) shows 
early convergence and summaries are based on the last 24 500 iterations from 25000. 
This leads to a posterior mean value for ° of 62.5, and to smoothed program effects bj 


ranging from 5.3 to 11.7, as represented in terms of posterior means (see Table 2.8). 
Then (Model B) uncertainty in k is introduced via an exponential prior, 


k ~ E(0.01) 


Table 2.8 Program effects: observed and smoothed 


School Effectiveness Data 


School Y s s? 1/s? 
A 28.39 14.9 222 0.0045 
B 7.94 10.2 104 0.0096 
C —2.75 16.3 265.7 0.0038 
D 6.82 11 121 0.0083 
E —0.64 9.4 88.4 0.0113 
F 0.63 11.4 130 0.0077 
G 18.01 10.4 108.2 0.0092 
H 12.16 17.6 309.8 0.0032 
Sum of 1/s? 0.0576 
Harmonic Mean 138.81 = (8/0.0576) 
for Variance 
Smoothed Effects Model A Model B 
Mean 2.5% 97.5% Mean 2.5% 97.5% 
by 11.7 —5.3 34.8 by 11.4 —4.8 34.8 
by 8.0 —7.7 23.8 [bo 8.0 —7.5 23.5 
u3 6.4 —14.6 24.4 u3 6.5 —14.2 24.0 
bg Teh —8.4 23.8 M4 7.7 —8.5 23.5 
bbs 53  —11.4 19.5 [bs 5.5 —11.3 19.5 
Le 6.2 —I11.5 21.8 Le 6.3 —11.8 21.8 
W7 11.0 —3.9 29.3 W7 10.6 —3.9 28.9 
Ug 8.7 —10.4 29.1 Mg 8.5 —10.3 29.0 
p° 62.6 4.3 4129 b’ 58.5 0.9 423.0 
M 8.1 —3.7 20.2 M 8.1 —3.8 20.3 
K 47.3 0.2 304.8 
Model C 
Mean 2.50% 97.50% 
by 10.19 —1.11 26.00 
[by 8.15  —2.73 19.68 
u3 7.20 —6.90 19.65 
bg 8.05 —3.44 19.85 
bbs 6.61 —5.88 17.48 
Le 7.20  —5.11 18.74 
u7 9.72 —0.64 22.44 
Ug 8.60  —4.06 22.52 
p? 4.23 0.06 14.15 
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A three chain run shows convergence for this parameter at around 2500 iterations, and 
from iterations 2500-25000 a median k of 24 and mean 47 are obtained, with the 
posterior mean for °? now 58.5. The posterior density for k has most mass on low 
values but has a positive skew. The smoothed effects are slightly more pulled to the 
grand mean than under Model A, and range from 5.5 to 11.4. Finally, Model C is 
defined via the Pareto prior in Equation (2.25). Note that initial values in three chains 
for the Pareto variable (in its BUGS parameterisation and denoted tau0 in Program 2.6) 
are randomly sampled from the prior. This model, after a run of 25 000 iterations, leads 
to greater smoothing with a posterior mean for ° now 32 and the program effects 
ranging from 6.4 to 10.0. 

Gelman et al. (1995, p. 147) cite a range in posterior median effects from 5-10 for 
these data. They conclude that one program effect, namely the first program with mean 
28.4 and empirical standard error 14.9, is unlikely in terms of the smoothing density 
m; ~ NM, *). Thus, in Model A one may compare yj with p; using the standard 
normal cumulative distribution function ®, and find that the median chance of a w; 
smaller than 28.4 is 0.998. 


2.4 DISCRETE MIXTURES AND DIRICHLET PROCESSES 


The priors considered above for the underlying population mixing density have 
a specific parametric form. However, to avoid being tied to particular parametric 
forms of prior, non-parametric options have been proposed, such as discrete mixtures 
of parametric densities (Laird, 1982) or Dirichlet process priors. The goal in choosing 
parametric or non-parametric mixing may be to improve fit and robustness of inferences 
by approximating more closely the true density of the sample (Robert, 1996; West, 
1992). This is especially so for observations that are multimodal or asymmetric in form. 
Here a parametric model assuming a single component population is likely to 
be implausible. 

For instance, in smoothing health outcomes over sets of small areas, especially when 
there may be different modes in subsets of areas, discrete mixtures may be used (Clayton 
and Kaldor, 1987). The mixture will generally involve a small but unknown number of 
sub-populations, as determined by an underlying latent set of group membership 
variables for each observation. There are then likely to be issues of discrimination 
between models involving different numbers of sub-populations, especially in smaller 
samples or if data likelihoods are relatively flat (see Bohning, 2000). 


2.4.1 Discrete parametric mixtures 


Discrete mixture models have wide flexibility in representing heterogeneous data, 
when a choice of parametric form for the heterogeneity is unclear or when inferences 
are sensitive to particular choices of parametric mixture. Assume a mixture of J sub- 
populations or groups, and let L; denote a latent group membership indicator for 
sample member i; L; can take any value between 1 and J. The latent data may also be 
expressed by multinomial indicators Z;=1 if L;=j and Zj;—=0 otherwise. If 
the missing data L; were actually known, the density of the data Y = (y1, .., Yn) 
could be written 


yilLi = j ~ fO%|8)) 
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The densities conditional on L are generally taken to have the same form, e.g. all normal 
or all Poisson, with the parameters of groups j being denoted 0;. For a Poisson the 0;’s 
would be means and for a Normal mixture the 6;’s could be group specific means and 
variances {m;, 6;}, varying means and a common variance {p;, >} or a common mean 
but differing variances {p, 6;}. The marginal density of y is then 


J 
fO)= 5 Silba; (2.26) 
j=l 


where 7; is the probability of belonging to sub-population j, and 


af 
j=! 


In practice the L; are unknown and the estimation problem relates to the mixing propor- 
tions 7, the parameters of the separate sub-populations, and the posterior estimates of 
group membership L; given the data and parameters. Typically, the vector 7 is assigned a 
Dirichlet prior with elements {a ),...,a;} with a typical choice being such as 
a, =.. = ayj = l; the posterior for m is then provided by a Dirichlet with elements 
K; «aj, where Kj is the number of sample members assigned to the jth group. The 
probabilities governing the assignment of subjects i to groups j are specified by 


Pj = Pr(Zq = 1) = mf 0:10) /f 0) (2.27) 


where f(y;) is as in Equation (2.26). If means differ by sub-population, then the 
smoothed mean p; for the ith subject is the sum over j of the population means p; 
times the assignment probabilities 


p: =X Pym, (2.28) 
J 


These means for subjects will often show shrinkage towards the global mean 
À= TIM WW. ..T1yp, (e.g. in disease mapping applications), even if the subjects 
have high posterior probabilities of belonging to groups with means much above or 
below the global mean. 

The major issues in identifying mixture models using parametric densities f ( y|0;) are 
the general question of identifiability in the face of possibly flat likelihoods (Bohning, 
2000), and the specification of appropriate priors that are objective, but also effective in 
estimation. Thus, Wasserman (2000) cites the hindrance in mixture modelling arising 
from the fact that improper priors yield improper posteriors. More generally, vague 
priors even if proper, may lead to poorly identified posterior solutions, especially for 
small samples. Various approaches to prior specification in mixture modelling have 
been proposed and often mildly informative proper priors based on subject matter 
knowledge may be employed. 

There are also issues of ‘label switching’ in MCMC estimation of mixture models. If 
sampling takes place from an unconstrained prior with J groups then the parameter 
space has J! subspaces corresponding to different ways of labelling the states. In an 
MCMC run on an unconstrained prior there may be jumps between these subspaces. 
Constraints may be imposed to ensure that components do not ‘flip over’ during 
estimation. One may specify that one mixture probability is always greater than an- 
other, or that means are ordered, u; > p >... > py, or that variances are ordered. It 
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remains problematic whether such constraints distort the final estimate. Depending on 
the problem, one sort of constraint may be more appropriate to a particular data set: 
constraining means may not be effective if subgroups with different means are not well 
identified, but groups with different variances are (Fruhwirth—Schattner, 2001). A 
sophisticated constraint scheme is proposed by Robert (1996) which for continuous 
Normal or Student data encompasses both mean and variance: for a two group model, 
this takes the form 


Y~pN(p, 0?) (1—p)N(p o9, 7°07) 


There is also the possibility of empty groups (e.g. only J — 1 groups of cases are chosen 
at a particular MCMC iteration when a J group model is being fitted); to avoid this, 
Wasserman (2000) proposes a data dependent prior that avoids a null group occurring. 
Chen et al. (2001) seek to overcome boundary and identifiability problems by introdu- 
cing an extra penalty component to the likelihood that amounts to introducing JC extra 
‘observations’, C with mean p,, C with mean w,, and so on. As an example, let there be 
J = 2 sub-groups with probabilities 7 and (1 — 7), differing only in their means p; and 
by, With p; < p. The number 2C of extra observations is related to the precision on the 
prior for the p's. Thus if 1, p are uniform with priors U(—R, R) then Chen et al. 
take C = log (R). For a normal prior on the p;, one might take C = log (30), where g? is 
the prior variance of the p;. The likelihood? for J = 2 is penalised as follows: 


Yilogl(1—7) file) Tafil] Clog [4m — 7)] 


This approach generalises to J > 2 sub-groups, with the same inequality constraint on 
the p;’s, and with a penalty CEZ; log (27). 


2.4.2 DPP priors 


Dirichlet Process Priors (abbreviated as DPP priors) offer another approach avoiding 
parametric assumptions and, unlike a mixture of parametric densities, are less impeded 
by uncertainty about the appropriate number of sub-groups (see Dey et al., 1999). The 
DPP method deals with possible clustering in the data without trying to specify the 
number of clusters, except perhaps a maximum conceivable number. Let y;, i = 1, .., n, 
be drawn from a distribution with unknown parameters v;, Ẹ;, 
SOilvi i) 

where a Dirichlet process prior is adopted for the v;’s but a conventional parametric 
prior for the g,’s. The Dirichlet process specifies a baseline prior Go from which 
candidate values for v; are drawn. For instance, in Example 2.1 most of the incidence 
ratios \; are likely to be between 0.25 and 4 in a hierarchical model, so a suitable 
baseline prior Go on the v; = log (A;) might be a N(0, 0.5) density. 

Suppose the v;’s are unknown means or log means for each case, and that clustering in 
these values is expected. Then for similar groups of cases within a cluster, the same value 
6 of v; would be appropriate for them. Theoretically, the maximum number M of 
clusters could be n, but it will usually be much less. So a set of M potential values of 
60, denoted 6,,, m = 1, .., M, is drawn from Go. The most appropriate value O, for case 


? In BUGS, for J = 2 components in a normal mixture with equal variance across groups, and C rounded to 
an integer, the prior for the additional points might be specified with the coding: 

for (i in 1:C) {y.aux[i] ~ dnorm(mu[1],tau); y.aux[i C] ~ dnorm(mu[2],tau)} 

for (i in 1:n) {y[i] ~ dnorm(mu[L{i]],tau); L[i] ~ deat(pif[1:2])} 
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i is selected using multinomial sampling with M groups, and the Dirichlet prior on the 
group probabilities has a precision parameter a. So the cluster indicator for case i is 
chosen according to 


L; ~ Categorical(p) 


where p is of length M, and has uniform elements determined by a precision parameter 
a. Then case į is assigned the parameter 0z,. The parameter a may be preset (typical 
values are a = 1 or a = 5), or itself assigned a prior. Following Sethuraman (1994) and 
Ishwaran and James (2001), one may generate the mixture parameters by considering 
the v,’s as iid with density function q, where 


q) =X pj hO) (2.29) 
j=l 
where A is the density of the 6; under Go. In practice, the mixture may be truncated at 


M < n components with oe pial. 
The random mixture weights p; at any iteration may be constructed by defining a 


sequence r1, 2, .-, rm-1 Of Beta(1, œ) random variables, with ry = 1. Thus, sample 
r;~ Bl,a) j=1,.., M (2.30) 
and set 
A=!" 
p=n(l-ri) 


p3 =r3(l — r) — rı) 


and so on. This is known as a stick-breaking prior, since at each stage what is left of a stick 
of unit length is broken and the length of the broken portion assigned to the current value 
p;. At any iteration it is likely that some of the M potential clusters will be empty and one 
may monitor the actual number J of non-empty, though differing, clusters. Other 
sampling strategies for the r; are discussed by Ishwaran and James (2001); one option 
takes the random weights as r; ~ B(a;, bj), where a; = 1 — a, bj =jaand0<a< 1. 


Example 2.7 SIDS deaths Consider the well known data on SIDS deaths S; among 
births B; in 100 counties of South Carolina during 1974-78. The statewide SIDS death 
rate per 1000 births is 2 per 1000 (667 deaths in 329 962 births), but some counties have 
(unsmoothed) rates approaching 10 per 1000. Symonds et al. (1983) consider a two 
group Poisson mixture for these events, such that 


S; N Poi(R;B;) 
Li ~ Categorical(m.2) 


where 
Ri = Mz; 


is the Poisson mean in the group with index L;. It is common practice to treat rare vital 
events as Poisson in this way, though strictly they are binomial; in more complex models 
adopting Poisson sampling, occasional inadmissible values may be generated for death 
rates R; unless a constraint in the prior is included. An option is to reframe the problem 
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in terms of expected deaths £;, where E; = RB;, and R is here the statewide rate. Then 
S; ~ Poi(v;£;), where the v;’s are relative risks with average close to 1 (and exactly 
equalling 1 if the sum of E; equals the sum of the Sj). 

The smoothed mean for the ith county is drawn from a mixture with two components, 
and will be determined by the average of formula (2.28) over a large number of 
iterations: the mean for county i will be smoothed towards the sub-population mean, 
depending on its highest assignment probabilities. In “disease mapping’ applications 
such as these, a primary object is the identification of counties with genuinely high risk. 
In this regard, simple maximum likelihood estimates (or ‘crude’ rates) 1000S;/B; may 
distort true high risks (e.g. giving spurious impressions of clustering), especially if the 
most extreme rates are based on small birth totals or small death totals. 

For the two group parametric mixture model, G(0.0, 0.01) priors on pj, j= 1, 2 are 
adopted and a Dirichlet D(1, 1) prior on the mixing probabilities 7;. An initial uncon- 
strained run (with a single chain) suggested two well-separated means; a subsequent 
three chain run was therefore undertaken with a constraint on the means. Note that for 
county memberships L; arbitrary initialising may be used, though preferably not all set 
to one subgroup. A run of 10000 iterations identified a minority of areas with high 
SIDS death rates (mortality at around 3.8 per 1000 births, with a summary in Table 2.9 
based on the second half of the run). In terms of the resulting estimates of SIDS rates at 
county level, broadly similar results to Symonds et al. are obtained though with 
different rankings within the high risk counties. The estimates of county rates are 
based on a further 1000 iterations subsequent to the first 10000. 

The shrinkage obtained under this model is exemplified by county 87 (Swain county) 
with a crude SIDS rate of 4.4 per 1000, but based on three deaths among only 675 births 
over five years. Despite its crude rate exceeding the higher subpopulation mean of 3.8 
per 1000, it has a posterior probability of 0.24 of belonging to the low risk group, and a 


Table 2.9 Mixture parameters (two groups) and highest posterior SIDS rates 


Mean St. devn. 2.50% Median 97.50% 


SIDS Rates per 1000 Births 


Anson 3.70 0.52 2.71 3.64 5.02 
Robeson 3.69 0.53 2.70 3.64 5.01 
Halifax 3.69 0.53 2.67 3.64 5.02 
Columbus 3.64 0.61 1.69 3.62 5.00 
Northampton 3.57 0.69 1.62 3.59 5.00 
Rutherford 3.42 0.77 1.62 3.51 4.86 
Rockingham 3.42 0.80 1.58 3.52 4.89 
Bladen 3.26 0.91 1.55 3.43 4.87 
Hertford 3.21 0.93 1.52 3.40 4.89 
Hoke 3.18 0.95 1.52 3.38 4.89 
Washington 3.02 0.99 1.47 3.28 4.81 
Bertie 2.98 1.02 1.46 3.25 4.87 
Means of components 

by 0.0017 0.0001 0.0014 0.0017 0.0019 
W2 0.0038 0.0005 0.0028 0.0037 0.0049 
Probabilities of components 

TI 0.765 0.098 0.538 0.780 0.912 


m2 0.235 0.098 0.088 0.220 0.462 
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smoothed rate of 2.5 per 1000. It may be noted from Table 2.9 that this model provides 
essentially no discrimination between the counties with the three highest rates (counties 
4, 42 and 78): they have a virtual unity probability of belonging to the higher risk group 
and identical median SIDS rates of 3.64. 

One advantage of discrete parametric mixtures is that the number of model param- 
eters is usually well defined, and so penalised fit measures such as the AIC or BIC 
present no problem in definition. Note that this is not the case for random effects 
mixtures, which is why the DIC approximation discussed above has been developed. In 
particular, one might base choice between a model with J 1 components (Model 2) 
against one with J components (Model 1) in terms of that model with minimum AIC or 
BIC. There is a case for using the BIC here since the ‘sample size’ might be defined not 
by the number of counties, but by the number of births in them (Upton, 1991), and also 
because the BIC approximation becomes more reliable for large samples. 

For closely competing models, one might want to assess the probability that the AIC 
or BIC for Model 2 exceeds that for Model 1. If a Poisson mixture with three compon- 
ents is compared against one with two components, there are pı = 3 parameters in the 
simpler model and pz = 5 in the more complex one (three means, and two free mixture 
probabilities). If the models are run in parallel and at each iteration DY 2p2 is 
compared with DP 2pı (where D = —2 log L), then accumulating over iterations 
gives the probability that model j has the minimum AIC (Model B in Program 2.7). 

In this application, the priors on the means for both models are constrained, and an 
upper limit for the highest smoothed mean is set by the maximum crude rate of 9.6 per 
1000. Without this limit there is a tendency for impossible means (mortality exceeding 
100%) to be generated. 

A three chain run of 25 000 iterations (and burn-in of 5000) with initial values for two 
chains provided by a trial run, shows some support for a three group solution. In 
approximate terms, a quarter of counties belong to a low risk sub-group, 50% to a 
medium risk group, and a quarter to a high risk group (Table 2.10). Comparison of the 
AICs at each iteration between the models shows the three group mixture to have a 
lower AIC with a probability of 60%. 

Finally, a Dirichlet process prior approach is applied, with the goal of assessing the 
relative SIDS risks in Anson, Robeson and Halifax counties under a semi-parametric 
approach, and also whether the relative risk in Swain county is clearly above 1 or 


Table 2.10 Comparison of two and three group models 


Mean St. devn. 1% Median 99% 
2 sub-populations AIC 10030 150 9669 10030 10360 
Group Means u 0.0016 0.0001 0.0013 0.0017 0.0019 
bo 0.0037 0.0005 0.0026 0.0036 0.0050 
Group Probabilities 7 0.745 0.105 0.429 0.759 0.923 
T2 0.255 0.105 0.077 0.241 0.571 
3 sub-populations AIC 9968 198 9262 9989 10330 
Group Means u 0.0008 0.0007 0.0000 0.0011 0.0018 
bo 0.0020 0.0005 0.0014 0.0018 0.0036 
u3 0.0040 0.0009 0.0027 0.0039 0.0080 
Group Probabilities 7 0.243 0.244 0.000 0.184 0.818 
T2 0.556 0.221 0.043 0.597 0.896 


T3 0.202 0.104 0.013 0.189 0.507 
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whether alternatively it straddles 1 (and so is not definitively high risk). The reframed 
format describe above is adopted, with expected SIDS deaths defined as E; = RB;, and 
taking S; ~ Poi(v;£;). The relative risks v; = exp(o,) are modelled via a DPP with a 
ceiling of M = 10 clusters, and with an N(0, 1) baseline prior on ;. The prior on r; is as 
in (30) with a = 1 preset. 

A three chain run (with initial values randomly generated from the priors) shows 
convergence of the non-empty cluster total J at around iteration 1000 and the summary 
is based on iterations 1000-5000. This shows the relative risk clearly highest in Anson 
county, and with the Swain county risk straddling 1 (Table 2.11). The distribution of 
non-empty clusters shows six clusters as the most frequent (Figure 2.1). One may wish 
to assess sensitivity of inferences to different baseline densities (e.g. Student ¢ with mean 
0, variance | and low degrees of freedom) or to alternative values of a. 


Example 2.8 Exponential mixtures for patient length of stay distributions The work of 
Harrison and Millard (1991) and McClean and Millard (1993) relates to lengths of stay 
of patients in hospital, with a particular focus on patient lengths of stay in geriatric 
departments. Lengths of stay of other classes of patients have also attracted discrete 
mixture analysis (especially psychiatric patients), as well as other patient characteristics 
such as age at admission (Welham et al., 2000). Lengths of stay in geriatric departments 
exhibit pronounced skewness, but exponential mixture models with relatively few com- 
ponents have been found effective in modelling them. The analysis here shows the utility 
of Bayesian sampling estimation in deriving densities for structural or system parameters. 


Table 2.11 DPP prior on SIDS deaths; selected parameters 


Non-empty clusters Mean St. devn. 1% Median 99% 


J 5.99 1.55 3 6 9 
Relative Risks 

Anson (county 4) 3.07 1.26 1.44 2.68 6.76 
Halifax (county 42) 1.94 0.48 1.03 1.89 3.48 
Robeson (county 78) 1.80 0.35 1.07 1.78 2.65 
Swain (county 87) 1.27 0.54 0.55 1.09 3.01 
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Figure 2.1 Number of DPP clusters in SIDS deaths analysis 
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Thus, let x denote length of stay for a patient in days, and suppose a two component 
exponential mixture, EM(2), is fitted 


f(x) = wWexp(— Bx) (1 —a)exp(— Dx) (2.31) 


where the identifiability constraint is that the average length of stay 1/B in group 1 is 
lower than that in group 2. Harrison and Millard (1991) draw on continuous time 
models for drug flow in pharmacokinetics to derive interpretations for the EM(2) 
parameters. Envisage two classes of patients, a standard stay group, of size A(t) at 
calendar time f, and a long stay group of size L(t) at t (these are distinct from the two 
groups in Equation (2.31)). The long stay group (were it possible to identify them on 
admission to hospital) form an unknown fraction k of patients admitted to the geriatric 
unit, but no one is formally admitted as a long stay patient. Instead, they are notionally 
converted to such a status at some time during their hospital stay. The standard group 
account for the remaining 1 — k share of the patient population. 

Harrison and Millard (1991) assume a steady state system operating at full capacity, 
with A(t) = A and L(t) = L constant and in equilibrium. The stock A in the standard 
group is kept constant by (a) patient discharges from hospital at a rate r, by (b) 
conversion from standard to long stay status at rate v, and by (c) a constant number 
of new patients per day Ao. Finally, long stay patients themselves have a low chance d of 
discharge (including mortality). Under the steady state assumption, it can be shown that 


k=v/wv r-d) 


Under the exponential model for stay lengths, the rate of loss through discharges, 
deaths, and so on, is the inverse of the expected length of stay in days. So the expected 
length of stay for standard patients is 


S =1/v r) 
while that for long stay patients is 
S2 = 1/d (2.32) 
The expected length of stay at admission under the steady state model is therefore 
S=(1-k)Si k62 


Let T be the total number of patients, or equivalently under equilibrium, the number of 
occupied beds in the hospital. Then, from Equation (2.31), the total of patients who 
have been in hospital for at least x days is given by 


t(x) = Turexp(— Bx) T(1 —7)exp(— Dx) 
while from the steady state model assumptions, it can be shown to be 
t(x) = (1 — kA SA — v-r) k Ag So — dy (2.33) 


Under the steady state model, admissions equal exits so that Ap is (Harrison and 
Millard, 1991, Eq. (17)) given by 


(v rTu dT(1-— m) 


Also, equating parameters in Equation (2.33), the proportion k of long stay patients 
may be derived from 


kAoS2 = Td = T) 
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Using Equation (2.32) gives 
k =dT(1—7)/Ao 


So the parameters of the EM(2) model may be used to estimate system parameters, 
resulting, say, from better discharge arrangements to non-hospital care for long stay 
patients. For example, if the daily discharge rate d for such patients is increased by 1/ 
(365L), then (r v)/@ d) more patients per year can be admitted. 

To illustrate the practical working of this approach, and the derivation of system 
parameters from basic model parameters such as those in Equation (2.31), consider 
length of stay observations for T = 469 patients in a psychiatric hospital in North East 
London in January 1991. These are incomplete lengths of stay, but following the work 
discussed above, there is no need to model the censoring mechanism to obtain the 
structural parameters. Following the steady state model of Harrison and Millard as 
discussed above, we consider a two group mixture. One may, however, generalise these 
concepts for, say, J = 3 group mixtures and higher by invoking conversions from a 
short stay group to a standard group at rate vı, from a standard group to long stay at 
rate v2, etc. 

The exponential mixture model (2.31) in Program 2.8A is parameterised so that B 
is larger than D, i.e. so that lengths of stay in group 1 (with prior probability 7) are 
shorter than those in group 2. From a three chain run of 25000 iterations, with 
convergence after a 500 iteration burn-in, a median estimate for m of 0.565 is obtained 
(Table 2.12). This larger group has an exponential parameter of 0.0016 and average 
length of stay of 620 days, compared to over 7500 days in the smaller group. To 
informally assess model fit, one may monitor the ‘complete data’ likelihood, taking 
the latent group membership indices (L[i] in the coding in Program 2.8A) at each 
iteration as known. (Note that Example 2.7 used the ‘incomplete data’ likelihood.) 
This likelihood has a posterior median of —3999. So there is a clear gain in fit over a 
single group exponential model, which has median likelihood —4324 and estimates an 
average length of stay of 3700 days. 

The proportion k of long stay patients under the steady state model is just under 6%, 
and the median number of such patients is then L = 27. Increasing the discharge rate 


Table 2.12 Structural length of stay parameters: steady state patient flow model 


Parameter Mean St. devn. 2.5% 97.5% 

Ag Admission Rate (New patients per day) 0.457 0.041 0.383 0.547 
Complete Data Log-likelihood —3999 11.5 —4023 —3977 
Expected LOS, Mixture Group 1 620.9 74.1 478.2 769.7 
Expected LOS, Mixture Group 2 7810 720 6543 9336 

L Number of long stay patients 27.23 4.41 19.58 36.94 

S Expected LOS on admission, all patients 1034 92.5 857.4 1225 
Benefit from higher discharge rate of 7.596 0.813 6.105 9.298 
long stay patients 

a Mixture Probability 0.565 0.040 0.487 0.642 

K Proportion long stay 0.058 0.009 0.042 0.079 

B Exponential Mean, Group 1 0.0016 0.0002 0.0013 0.0021 

D Exponential Mean, Group 2 0.00013 0.000012 0.00011 0.00015 

R Standard group discharge rate 0.0015 0.00019 0.0012 0.0020 

v Conversion rate, standard to long stay 0.000088 0.000019 0.000056 0.00013 
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among long stay patients by 1/(365L), or by one patient a year, means that on averaged 
7.6 more patients per year can be admitted; this is the posterior mean of (r__v)/(v_d). 
The uncertainty attaching to some of the structural parameters, such as the length of 
stay in group 2 (95% interval from 6500 to 9300 days) feeds through to the structural 
parameters. Thus, the number of new patients per day, Ao, is estimated with 95% 
interval {0.38, 0.55}, and the number of long stay patients has a 95% interval between 
20 and 37. 

We consider next mixtures with more groups (though without estimating the struc- 
tural parameters). Thus for J = 3, and running three chains for 25 000 iterations yields 
convergence after 2500 iterations. The two extra parameters produce a further gain in fit 
(Model C in Program 2.8A). The median complete data likelihood is —3916, and a short 
stay group with average LOS around 54 days, exponential mean 0.0203, and relative 
probability of 12%, is identified. 

A DPP version of this mixture analysis for the patient lengths of stay (Program 2.8B) 
is also applied. A maximum possible M = 5 alternative exponential means is set, i.e. up 
to five potential clusters in which the 469 lengths of stay can be arranged. Trial analysis 
suggested that most of the mass was concentrated on 3-5 clusters, even for larger a (the 
average number of non-empty clusters J is expected to be increase with the Dirichlet 
precision parameter). As well as monitoring the number of non-empty clusters and the 
complete data likelihood, one may track the largest cluster mean (th.R[M] or th.R[5] 
in the code), corresponding to the shortest stay patients. Monitoring convergence 
is complicated by the fact that parameters may have different meanings according 
to whether the number of non-empty clusters at a particular iteration is 
J=3,J/=4, J = 5, etc. 

A three chain run with a = 1 is run for 5000 iterations, with the second half giving an 
average for non-empty clusters J of 3.1, and complete likelihood averaging —3918. The 
two cluster solution was not selected in any iteration. The exponential parameter 
corresponding to the short stay group has an average value 0.37 when five clusters are 
chosen (in 27 of 7500 iterations), 0.085 when four clusters are selected (which occurs in 
about 12.5% of iterations) and an average value 0.0213 at iterations when three clusters 
are chosen. To assess this patterning, it is necessary to use the CODA facility in BUGS 
and sort iterations according to the number of clusters selected. Increasing the Dirichlet 
parameter (e.g. to a = 5) does raise the proportion of iterations when four or five 
clusters are selected, but there is no overwhelming evidence to suggest the existence of 
four or more clusters in the data. 


2.5 GENERAL ADDITIVE AND HISTOGRAM SMOOTHING PRIORS 


Many types of smoothing problems involve a series of observations y1, ..., Yn at 
equally spaced observation or design points (e.g. consecutive ages, income groups, 
etc.). While often considered as time series methods, it is worth emphasising the scope 
for more general applications. Among the approaches which have figured in the 
Bayesian literature, we consider here two relatively simple methodologies which are 
suitable for a wide range of problems. One is based on smoothness priors derived from 
differencing an underlying series of true values (Kitagawa and Gersch, 1996). The other, 
histogram smoothing, is applicable to frequency plots, including those deriving from 
originally metric data aggregated to equal length intervals — an example involving 
weight gains in pigs is considered by Leonard and Hsu (1999). 
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2.5.1 Smoothness priors 


For the smoothness prior approach, it is assumed initially that the y; are metric with 
normal errors. Extensions to discrete outcomes and unequally spaced observation 
intervals are also considered below, as are extensions to grouped observations — multiple 
observations at a single design point t. 

The model for metric observations is then y; = f, e,, where e, ~ N(0, 0”), and where 
the fps may be taken as an underlying series, free of the measurement error in the 
observed series. The measurement error assumption leading to an underlying series 
which is the object of smoothing is especially frequent for continuous data. For discrete 
outcomes, the focus of smoothing is typically on the mean count, risk or probability. 
Thus, if the y, were Poisson, then one might have 


Yı ~ Poi(y,) 
log (m) =fr=fi-1 ur 
while if they were binomial with populations n,, then the model might be 
Yı ~ Bin(m,, n) 
logit(t,) = fı = f1 u 


A widely used smoothing model assumes Normal or Student distributed random 
walks in the first, second or higher differences of the fps. For example, a Normal 
Random Walk (RW) in the first difference Af; is equivalent to the smoothness prior 


f~ NG- 7) (2.34) 
and a random walk in the second difference A’/, leads to the prior 
fi~ NOF- —fi-a T) (2.35) 
These two priors can be expressed in the alternative forms 
fi=fi-1 u 


and 


Si = fiif Ut 


respectively. These models have to be set in motion by assumptions about the 
initial parameter(s): this involves separate priors on the initial parameters, such as fi 
in the case of a first difference RW, fı and f in the case of a second order RW, and so 
on. A common practice is to ascribe vague priors with large variances to these param- 
eters. 

A rationale for this is provided by the measurement error model for continuous 
Normal data. Here the ratio \ = o? /7? is a smoothing parameter, which for a second 
order random walk, RW(2), appears in the penalised least squares criterion 


Da e Soe t 
t=3 


With flat initial priors on the initial fps in the series the posterior modes for f; are 
equivalently those that minimise this criterion. Thus, one might set initialising priors 


r~NOKD, j=1,..,r 
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where r is the order of differencing in the smoothness prior and k is large. 

It may be noted that one is not confined to asymmetric priors like (2.34) and (2.35), 
which make most sense if t does denote time; if the index t were something like income 
group, then it might make sense for f; to depend upon adjacent points at either side. 
Thus, a generalisation of the first order prior might involve dependence on both /;_; and 
Ji 1. For example, 


FSS Sha u 2<t<n-1 
Ai=h u 
Sn =Sn-1 Un 


Further issues concern specialised priors in the event of unequal spacing or tied 
observations. For unequal spacing, it is necessary to weight each preceding point 
differently, and to change the precision such that wider spaced points are less tied to 
their predecessor than closer spaced points. Thus, suppose the observations were at 
points ti, fo, ..., tf with ò = b — ti, ò = b3 — b, ..., Òn-1 = fn — tn-1. The smooth- 
ness prior (2.34) would become 


ti i N( fi-1; 8,7’) 


so that wider gaps 6, translates into larger variances. Similarly, the prior (2.35) would 
become 


fi ee NO, òT) 


where v; = fi-1(1 8, /8;-1) — fr_-2(8,/8)-1). vı reduces to the standard form if 8, = 8)-1. 
For multiple observations at one point in the series, we have one model for the n 
observations and a prior for the G distinct values of the series index. For instance, 
suppose the observations y; were 1, 3, 4, 7, 11, 15 (n = 6) at observation points 1, 1, 2, 2, 
3, 4 (i.e. G = 4). Then if g, denotes the grouping index to which observation t belongs, 
and a first order random walk in the underlying series is assumed, we would have 


ye ~ N( fled, o°), t=T h 
fe~N fn 7), k=2,..,G 


2.5.2 Histogram smoothing 


Suppose observations of an originally continuous variable y are grouped into frequency 
counts, with f; being the number of observations between cut points Kj, and Kj. 
For example, Leonard and Hsu (1999) discuss the pig weight gain data of Snedecor 
and Cochran (1989), where weight gains of 522 pigs are aggregated into 21 intervals of 
equal width. One may also consider counts f; of events (e.g. migrations, deaths) 
arranged in terms of a monotonic classifier, at levels j = 1, .., J, such as age band 
or income level. Often, sampling variability will mean the observed frequencies f; in the 
jth interval will be ragged in form when substantive considerations would imply 
smoothness. 

Let prior beliefs about such smoothness be represented by an underlying density A( y), 
and let m; be the probability that an observation is in the jth interval 


K; 
Tj = | h(u)du 


Kj- 
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The observed frequencies are assumed to be multinomial with parameters 
{a1, T2, .., TJ}, and to imperfectly reflect the underlying smooth density. A prior 
structure on the m;’s that includes smoothness considerations is appropriate. Specific- 
ally, express 7; via a multiple logit model 


mj = exp(,)/ XC exp (x) 
k 


where the b, are taken to be multivariate normal with means a1, a, .., ay and J x J 
covariance matrix V. A noninformative prior on the 7;’s would take them to be equal in 
size, 1.e. 


Tj = 1/J 
which is equivalent to 
aj = — log (J) 


In the case of frequencies arranged by age, income, etc., there are likely to be total 
population or exposure totals £;, and the appropriate prior on the m;s then would 
involve taking 


m= E;/ X Ex (2.36a) 
k 


or 


aj = log (E;/ X. Ex) (2.36b) 
k 


The covariance matrix V is structured to reflect dependence between frequencies neigh- 
bouring on the histogram classifier. Thus, one option is a first order dependence of 
order 1 with correlation p, as discussed in Chapter 5. Thus 


Vj = plo? (2.37) 


Then p = 0 leads to exchangeability in the histogram probabilities, that is to a joint 
distribution that is unaffected by permutation of the suffixes 1, 2,.., J. This is contrary 
to expectation in many situations where greater similarity is anticipated between m; and 
m; ı than between m; and m; 2. Note that defining k = 1/(o7 — o’p), the elements of 
T = V`! in the case of the prior (2.37) are given by 


Ty, = Ty) =k, 
Tj=K( p j=2,..,J—1 
Tyi=Ā}ig=-kp Dead 


Tj = 0 elsewhere 


Example 2.10 Hemorrhagic Conjunctivitis As an example of the smoothness prior 
approach, though from a maximum likelihood perspective, Kashigawi and Yanagimoto 
(1992) consider smoothing of serial count data through a state-space model, specifically 
weekly data y; on new cases of acute hemorrhagic conjunctivitis in the Chiba-prefecture 
of Japan in 1987. 
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Thus, if y; ~ Poi(a,), t = 1,.. n, with the data sequenced in time, they assume that the 
\,’s change gradually through time. They therefore impose a difference prior of order r 
on the log link v; = log ),, with 


A’y; ~ NO, 0”) (2.38) 


The case r = 1 with o? = 0 leads to a trendless model with constant incidence of new 
cases, `} = Ay =... = Ay. This leads on to a test for homogeneity of the means through 
time (with r = 1 in Equation (2.38)), namely o? = 0 vs. ø? > 0. While these data are 
time series, the approach is applicable to any series arranged against a monotonic index 
(e.g. age, income, weight). Kashigawi and Yanagimoto detected a significant departure 
from a stable mean weekly incidence in these series with an increased level apparent 
between about the 28th and 38th weeks. 

Here a flat N(0, 1000) prior on fı is assumed with an RW(1) smoothness prior, 
and initially a G(1, 0.001) prior on 1/o7. The findings of Kashigawi and Yanagimoto 
are replicated, and Figure 2.2 shows the underlying series of mean incidence. The 
mean number of new cases peaks at around 1.21 in the weeks 35-38, then falls back 
to around 0.7 at the end of the year. This smooth gives an appearance of over- 
smoothing, and Figure 1 of Kashigawi and Yanagimoto in fact shows a smooth 
where there is a more pronounced variation in mean incidence than in Figure 2.2. 
Their smoothed series can be replicated more closely with a G(1, 1) prior on o~’, 
since this places lower weight on high precisions (lower variances o? which produce 
less smoothing). The means vary now from 0.25-—2.4. To resolve the problem of the most 
appropriate smoothing variance, several alternative priors might be adopted and their 
appropriateness compared with simple fit measures (e.g. log-likelihoods) or formal 
Bayes choice methods. With the G(1, 1) prior on the precision, the log-likelihood of 
a trend model (with non-zero variance o?) averages —56, as against —69 for a constant 
mean model. 

To estimate the effective number of parameters in the model with a G(1, 1) prior on 
the precision, the DIC procedure suggests six effective parameters based on comparing 
the mean deviance of 33.6 with the deviance at the posterior mean of 27.8. Note that 
deriving this (in a non-iterative program) involves the posterior estimates of \; as these 


week 17 
week 21 
week 25 
week 29 
week 33 
week 37 
week 41 
week 45 
week 49 


—— G(1,0.001) Prior —=— G(1,1) Prior 


Figure 2.2 Smoothed plots of mean conjunctivitis incidence 
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alone define the likelihood. On this basis, the reduction in average deviance of around 
26 is offset by five extra parameters. 

Another option is to set a higher stage prior on the gamma parameters a, bı in 
1/o* ~ G(a, bı). For example, the choice 


a, ~ exp(1) and 
bı ~ G(0.001, 0.001) 1(0.001, ) 


is satisfactory in terms of identifiability, and yields a smooth series intermediate between 
the two in Figure 2.2. The DIC is worse than under the preset G(1, 1) prior on 1/o7. 
Fidelity to the original data is not the only criterion governing choice of a smoothing 
model, and genuine cross-validation (involving omission of cases and prediction by the 
remaining data) might also be investigated. 


Example 2.11 Mental illness hospitalisations As an example comparing histogram 
and state space smoothing, consider hospitalisations f; for mental illness at age j 
over two London boroughs in the year Ist April 1998 to 31st March 1999. These 
are classified by single year of age from ages 15-84, and may be seen as binomial 
outcomes in relation to age-specific population estimates P; at mid-1999. 


Fitting a histogram smoothing model entails setting (see Equation (2.36)) 
aj = log (P;/ XC Px) 
k 


where P; is the population at risk at age j. A U(0, 1) prior on the correlation parameter p 
is assumed, and a gamma G(1, 0.001) prior on 1/07; this defines the covariance matrix V 
as in Section 2.5.2. A three chain run shows convergence of Gelman—Rubin criteria for 
these two parameters at around iteration 1500, and the remaining iterations of a run of 
5000 provide the summary. 

Table 2.13 shows the smoothing of ragged frequencies to reflect the operation of 
the prior in adjacent categories of the histogram. Thus, the observed schedule shows 
19 hospital cases at age 21, six at age 22 and 25 at age 23. The smoothed version 
shows much less discontinuity with 16.5, 13.6, and 19 as the estimated frequencies 
of hospitalisation at these ages. Taking account of exposures to risk as in the 
prior parameterisation (2.36b) in fact makes little difference to the degree of 
smoothing obtained in this example. The posterior for p is concentrated above 0.9. 
A posterior mean estimate of p = 0.97 is obtained with 95% interval from 0.92 to 
0.995. 

A first order random walk state space model with binomial sampling at the first stage 
is then applied, with a G(1, 0.001) prior on the smoothing precision and an N(0, 1000) 
prior on the initial value f1. From a three chain run (5000 iterations with 500 burn-in) a 
posterior precision averaging 22 is obtained, with fı averaging —6.4. This compares to 
the logit of the first observed rate, logit(5/4236) = —6.74, so the first underlying 
probability has been smoothed towards the average. Values for remaining smoothed 
frequencies are close to those obtained under the histogram smoothing approach. The 
random walk and histogram smooth model are expected to be broadly similar for values 
of p in Equation (2.37) close to 1. 
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Table 2.13 Smooth histogram for mental illness hospital cases 


Age Popn. Actual Obs. Histogram smoothing Random Walk 
Hospitalisations Rate x 
1000 Mean 2.5% 97.5% Mean 2.5% 97.5% 


15 4236 5 1.2 7.4 4.3 11.5 7.1 4.2 10.9 
16 4293 7 1.6 8.2 5.3 11.9 7.9 5.1 11.6 
17 4378 8 1.8 9.6 64 13.5 9.3 6.2 13.1 
18 4760 13 25) 12.9 8.9 17.6 12:5 8.7 17.0 
19 4940 17 3.4 16.1 116 21.9 15.8 11.2 21.4 
20 4907 21 4.3 18.2 13.1 24.5 17.9 13.0 24.2 
21 4603 19 4.1 16.6 12.0 22.4 16.4 11.8 21.8 
22 4596 6 1.3 13.9 9.7 18.8 14.0 9.9 18.8 
23 4861 25 5.1 19.0 13.6 25.0 18.9 13.7 24.9 
24 5116 11 22 18.3 13.0 24.1 18.4 13.3 24.1 
25 5375 32 6.0 26.2 19.5 34.2 25.9 19.4 33.5 
26 5709 29 5.1 27.5 20.9 35.7 27.5 21.0 35.5 
27 5932 24 4.0 26.3 19.6 34.1 26.7 20.1 34.2 
28 5942 27 4.5 28.3 21.3 36.3 28.5 21.6 36.4 
29 5705 30 5.3 31.0 23.7 39.5 31.0 23.8 39.4 
30 5678 42 7.4 37.1 29.3 46.5 36.8 28.5 46.8 
31 5583 34 6.1 33.6 261 429 33.7 26.0 42.7 
32 5617 31 5.5 31.1 240 39.4 31.3 23.9 39.8 
33 5569 32 XT 29.0 22.0 37.4 28.8 21.8 37.1 
34 5610 19 3.4 23.4 17.5 30.6 23.2 17.1 29.9 
35 5932 25 4.2 24.0 17.9 31.2 24.0 17.7 31.0 
36 5758 19 3.3 21.6 15.7 28.3 21.9 16.1 28.5 
37 5600 24 4.3 22.7 16.8 29.6 23.0 17.0 30.1 
38 5414 23 4.2 22.6 17.0 29.6 22.7 16.8 29.9 
39 5176 22 4.3 21.6 16.1 28.1 22.0 16.3 28.5 
40 5099 21 4.1 21.2 15.8 28.4 21.6 16.1 28.0 
41 4950 23 4.6 21.2 15.7 27.8 21.5 15.9 28.0 
42 4782 14 2.9 190 13.9 25.4 19.4 14.2 25.6 
43 4646 25 5.4 22.7 16.9 30.2 22.8 16.9 29.6 
44 4491 26 5.8 24.0 17.7 31.6 23.9 17.8 31.2 
45 4626 24 5.2 24.2 17.8 31.3 24.1 18.2 31.3 
46 4591 27 5.9 23.7 17.4 31.0 23.6 17.6 30.9 
47 4502 16 3.6 19.1 14.0 25.0 19.3 14.0 25.3 
48 4565 20 4.4 19.1 13.9 25.3 19.1 13.9 25.3 
49 4689 19 4.1 184 13.0 24.5 18.5 13.4 24.5 
50 4856 18 3.7 17.5 12.5 23.8 17.5 12.6 23.3 
51 5188 15 2:9 16.5 11.7 224 16.6 11.8 22.2 
52 5662 15 2.6 17.2 124 23.1 17.4 12.6 23.3 
53 4426 20 4.5 146 106 19.9 14.7 10.7 19.9 
54 4339 4 0.9 11.6 7.9 15.7 11.8 8.1 16.1 
55 4329 16 3.7 14.3 10.2 19.5 14.3 10.2 19.5 
56 4098 19 4.6 15.4 11.1 21.0 15.2 10.9 20.6 
57 3711 13 3.5 13.0 90 179 13.0 9.1 17.7 


(continues) 
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Table 2.13 (continued) 


Age Popn. Actual Obs. Histogram smoothing Random Walk 


Hospitalisations Rate x 
1000 Mean 2.5% 97.5% Mean 2.5% 97.5% 


58 3411 9 2.6 11.3 7.7 15.5 11.3 7.9 15.7 
59 3605 15 4.2 12.7 8.6 17.5 12.5 8.8 17.2 
60 3731 7 1.9 12.0 8.2 16.4 12.1 8.3 16.6 
61 3680 19 5.2 14.4 10.1 19.7 14.2 10.1 19.4 
62 3580 12 3.4 13.0 9.1 18.3 13.0 9.1 17.8 
63 3480 9 2.6 12.3 8.6 16.6 12.4 8.5 16.9 
64 3355 13 3.9 13.8 9.8 18.5 13.7 9.7 18.6 
65 3330 23 6.9 16.3 11.6 22.1 16.1 11.5 21.9 
66 3259 10 3.1 13.4 94 18.1 13.3 9.5 18.0 
67 3291 11 3.3 13.5 9.5 18.2 13.5 9.6 18.3 
68 3277 16 49 15.3 11.0 20.9 15.3 10.9 20.7 
69 3209 12 3.7 16.1 11.3 21.7 16.4 11.7 21.7 
70 3144 24 7.6 21.7 160 28.4 21.5 16.0 28.3 
71 3033 25 8.2 25.0 18.7 32.2 24.7 18.6 32.2 
72 2985 30 10.1 29.0 22.3 36.9 28.7 21.9 36.9 
73 2929 24 8.2 30.4 23.1 38.4 30.8 23.4 39.1 
74 2831 47 16.6 45.1 35.6 56.7 45.2 35.8 56.1 
75 2753 71 25.8 61.9 50.0 75.4 61.2 49.1 74.8 
76 2688 49 18.2 53.2 42.8 65.7 53.4 42.6 65.2 
77 2718 59 21.7 59.4 48.1 71.7 59.5 48.2 72.2 
78 2738 69 25.2 67.5 56.1 80.9 67.5 55.2 81.5 
79 2710 63 23.2 69.8 56.8 83.5 69.9 56.7 84.6 
80 1626 64 39.4 61.1 49.7 74.9 60.9 49.5 73.9 
81 1446 69 47.7 68.7 564 82.4 68.3 56.0 82.3 
82 1502 90 59.9 86.7 72.2 101.9 86.8 72.2 102.5 
83 1489 86 57.8 89.5 74.5 106.4 90.2 75.7 106.2 
84 1444 115 79.6 109.2 91.5 128.5 110.0 92.7 129.5 
2.6 REVIEW 


The above discussion and examples have emphasised parametric smoothing methods 
based on exchangeable sample members. The fully Bayes approach to combining infor- 
mation over exchangeable units using exponential family densities is exemplified by 
George et al. (1994), and stresses the benefits (e.g. in fully expressing uncertainty) as 
compared to parametric empirical Bayes smoothing (see, for example, Morris 1983). 
The fully Bayes method implemented through repeated sampling allows the derivation 
of complex inferences concerning the relationships among the units, such as the density 
of the maximum or the density of the rank attached to each sample unit (Marshall and 
Spiegelhalter, 1998, p. 237). 

If the simplest model based on exchangeable units assumes smoothing to a common 
global mean and a unique variance readily made modifications may be more realistic, 
for example allowing asymmetric skewed densities (Branco and Dey, 2001) or allowing 
the data to be distributed as a mixture of two or three distributions with different means 
and/or variances. Adopting discrete mixtures of parametric densities leads into semi and 
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non-parametric Bayesian methods. Further flexibility is provided by the range of 
approaches based on the Dirichlet process priors (illustrated in Examples 2.7 and 2.8), 
and discussed by authors such as Dey et al. (1999) and Walker et al. (1999). These are 
more natural approaches if clustering of sub-groups within the sample is expected, or as 
providing a sensitivity analysis against baseline unimodal smoothing model. Sometimes 
the latter may suffice: for example, Marshall and Spiegelhalters analysis of 33 transplant 
centres failed to confirm a two cluster division of the centres. More specialised depart- 
ures from exchangeability occur in the analysis of spatially correlated data where the 
clustering is based on spatial contiguity (Chapter 7). 

Whether a unimodal symmetric density is appropriate or not as a basis for combining 
information, a further major element to the process of joint inferences about sample 
units is the presence of further relevant information, possibly over different levels of 
data hierarchies (pupils, schools, etc.). Hence inferences about means or ranks for 
sample units may need to take account of covariates: for instance, severity or casemix 
indices may be relevant to rankings of medical institutions (see Example 4.6). The next 
two chapters accordingly consider the modelling of covariate effects in single and multi- 
level data. 
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EXERCISES 


. Using the data in Example 2.1, consider comparing model fit (e.g. via the DIC 


approach) between the fixed effects and gamma-Poisson mixture models. The fixed 
effects model allows the underlying relative risks à; to be different, but does not 
relate them to an overall hyperdensity. Identify the largest rate under each approach 
and the probability that it exceeds the average rate (by using the sample to assess this 
probability). 


. Also in Example 2.1, again try to analyse via random effects, but using Normal and 


Student t mixtures applied in the log scale for \;. How far does the robust Student t 
alternative (with degrees of freedom an unknown parameter) make a difference to 
the smoothed relative risks? 


. In Example 2.2, apply the DIC procedure to discriminate between binomial and 


beta-binomial models. 


. Repeat the beta-geometric mixture analysis of Example 2.3 for couples with non- 


smoking female partners. Calculate the chi square statistic for comparing actual and 
predicted cycles to conception counts (as in Table 2.4). Also, consider how to use this 
statistic in a predictive check fashion (see Equation (2.15)). 


. In Example 2.8, try the DPP analysis with a Dirichlet precision parameter a of 5. 


How does this compare with the results when taking a = 1, and what are the 
implications for the number of sub-groups apparent in the data. Also, try to identify 
a four group mixture by ‘conventional’ discrete mixture methods and consider how 
identifiability is compromised. 


. In Example 2.11, program the sampling of replicate frequencies Z; for ages 15-84, 


and so compare the predictive criterion G* in Equation (2.14) between the two 
models. 
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CHAPTER 3 


Regression Models 


3.1 INTRODUCTION: BAYESIAN REGRESSION 


Methods for Bayesian estimation of the Normal linear regression model, whether 
with univariate or multivariate outcome, are well established. Assuming the predictors 
are exogenous and measured without error (i.e. not random), they may be conditioned 
on as fixed constants. For a univariate outcome the parameters are then the regression 
coefficients, linking the mean outcome for case i to predictors xj, Xi2,..Xip for 
that case, and the conditional or residual variance. With an inverse gamma prior on 
the variance, and conjugate Normal prior on the regression coefficients, conditional 
on the sampled variance, analytic formulae for the posterior densities of these coeffi- 
cients and other relevant quantities (e.g. predictions for new explanatory variable 
values) are available. These permit direct estimation with no need for repeated sam- 
pling. 

However, generalisations which include discrete outcomes, non-linear or varying 
coefficient relationships, non-conjugate priors (Carlin and Polson, 1991), or priors 
constraining coefficients to substantively expected ranges, may all be facilitated by a 
sampling-based approach to estimation. Similar advantages apply in assessing the 
density of structural quantities defined by functions of parameters and data. The 
Bayesian approach may also be used to benefit with regression model selection, in 
terms of priors adapted to screening out marginally important predictors, or in com- 
parisons between non-nested models (see Example 3.3). Recent methods tackle some of 
the identification problems (such as label-switching) in discrete mixture regression 
(Frihwirth-Schnatter, 2001). Bayesian methods have also been proposed (see the 
book by Dey et al., 1999) for discrete outcome data which are over-dispersed in relation 
to standard densities such as the Poisson. 

The development below is selective among the wide range of modelling issues which 
have been explored from a Bayes perspective, but intended to illustrate some potential 
benefits of the Bayes approach. The first case studies of Bayesian regression techniques 
include questions of predictor and model choice, for instance comparing marginal 
likelihood approximations with approaches with methods based on external validation 
(Section 3.2). Outcomes are binary, counts or univariate continuous. The focus is 
then on models for multiple category outcomes, which includes ordinal data, and 
discrete mixture regressions (Sections 3.3 and 3.4). These models illustrate the central 
role of prior parameter specification to reflect the form of the data. As one possible 
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methodology for nonlinear regression, Section 3.5 discusses random walk and state 
space priors. A final theme (Section 3.6) is possible approaches to robust estimation in 
the event of departures from standard densities, and the related problem of outlier or 
influential observations. First, though, we consider questions around the specification 
of priors on regression model parameters. Among the aspects to consider are that priors 
express accumulated knowledge or subject matter constraints, may be defined by the 
form of outcome (e.g. in ordinal regression), and may be central to identification (e.g. in 
discrete mixture regressions). 


3.1.1 Specifying priors: constraints on parameters 


Thus, prior constraints on parameters may be defined by subject matter as in certain 
econometric models. As a subject matter example, Griffith et al. (1993) discuss the 
aggregate consumption function in economics (with C=consumption, and 
Y=disposable income), 


C=B8; BY 


where B; is autonomous consumption (consumption when there is no disposable 
income), and B, is the marginal propensity to consume. Economic principles suggest 
that of every dollar earned as disposable, some will be consumed (spent) and some will 
be saved. Hence, B, should lie between 0 and 1. Similarly, baseline consumption is 
expected to be positive, so that B; > 0. 

Constraints on parameters can usually be dealt with in the prior for those parameters. 
Formally, let Q be the unconstrained parameter space of B and let R be a constraint, 
expressed as 


R:B € Or 


where Or is a subspace of Q. The posterior density of B given the constraint, outcomes y 
and regressors X is then, assuming P(R|y) 4 0, 


P(B|R, y, X) = PBly, X)P(RIB, y)/P(RIy) 


This is the posterior density that would have been obtained in the absence of a 
constraint, namely P(B|y, X), multiplied by a factor proportional to the conditional 
probability of the constraint given B. Such constraints on parameters imply that 
MCMC parameter samples BP at iteration ¢ may be generated from non-truncated 
densities, but if they violate the constraint they are rejected with probability 1. 

This principle also finds expression in applications where the constraint reflects 
the form of the outcome. Thus for ordinal outcomes, an underlying continuous scale 
may be postulated on which cutpoints — which are constrained to be increasing — 
correspond to increasingly ranked categories. Specifically, the cumulative odds model 
(McCullagh, 1980) is framed in terms of the cumulative probability y; of subject i being 
located in the first j categories of a ranked outcome. Then logit(y;) — or maybe some 
other link function of y; — is related to a series of cutpoint parameters 0; on 
the underlying scale. The utility of Bayesian approaches to ordinal outcomes is illus- 
trated below in terms of applications to diagnostic tests (Example 3.9). 
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3.1.2 Prior specification: adopting robust or informative priors 


Specification of the prior may be less clearly defined in certain applications by the subject 
matter or outcome form than in Section 3.1.1, but is still often central to model identifia- 
bility, fidelity to the observed data, and robustness in inference. Often, the structure of 
residuals suggests alternatives to standard density assumptions, in terms of heavier tails 
than under the Normal, as well as outlier and influential observations. To model heavy 
tails Dickey (1976), Chib et al. (1990) and Geweke (1993) are among those adopting 
univariate or multivariate Student ¢ densities as sampling densities for the data, or as 
priors for linear regression parameters. A contaminated Normal or Student £ model, 
in which outliers have shifted location and or variances, provides one Bayesian approach 
to outliers (e.g. Verdinelli and Wasserman, 1991). For instance, assuming a small 
probability such as ¢ = 0.05 that an outlier occurs, the density of a metric y; may take 
the form 


Sile, 0, 8) =(1—e)b(yi\p, o) eb(yil Di, 07) 


where D; is the shift in location for case 7 if it is selected as an outlier. One may adopt 
similar methods for shifts in regression coefficients. 

Other options to protect against the influence of outliers or influential cases may be 
particular to the form of outcome. With a binary outcome, for instance, a prior may 
be set ona ‘transposition’, namely when y; = 1 is the actual observation but the regression 
model provides much higher support for Pr(y; = 0) than for Pr(y; = 1). In terms of 
the contaminated data approach of Copas (1988, p. 243) this would be a measure of ‘the 
evidence that each individual observation has been misrecorded’ (see Example 3.14). 
Discrete mixture regressions also intend to improve robustness (to assumptions of a 
single homogenous density across all data points), as well as providing for substantively 
based differences in the regression coefficients between sub-populations (Wedel et al., 
1993). 

Model identifiability, and credibility in terms of existing knowledge, may be improved 
by adopting an informative prior for one or more regression coefficients. Such a 
prior might refer to results from similar studies. For instance, if the estimate of a 
regression parameter B from previous study had mean 8, and covariance Yo, we might 
inflate So by a factor a > 1 (e.g. a = 5) to provide a prior covariance on B in the current 
study (Birkes and Dodge, 1993, p. 156; Dellaportas and Smith, 1993, p. 444). More 
formal procedures for reflecting historical data may actually include such data in 
the likelihood, though down-weighted relative to the current data (Ibrahim and Chen, 
2000). 

One might also consider some summary statistic from the observed data to inform 
the specification of the prior, though with suitable down-weighting, so that priors 
become ‘gently data determined’ (Browne and Draper, 2000). Note, however, that 
strictly this is departing from fully Bayes principles. Sometimes it is preferable to 
avoid completely flat priors, especially if flat priors lead to effectively improper posterior 
densities or poor identifiability. Thus, Fernandez and Steel (1999) show how choosing a 
proper Wishart prior for the conditional precision matrix $! improves identifiability of 
multivariate Student ¢ errors regression. In mixture regressions (Section 3.5), some 
degree of prior information may be essential to identifiability. 
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3.1.3 Regression models for overdispersed discrete outcomes 


Below we consider regression applications both for metric and discrete outcomes. 
General Linear Models (GLMs) have been proposed as a unified structure for both 
types of outcomes. As elaborated by McCullagh and Nelder (1989, p. 28) several 
discrete densities (together with the Normal and Student ¢ densities) can be encom- 
passed within the exponential family. The exponential family density (Gelfand and 
Ghosh, 2000) has the form 


f(yi|8) = exp {o7 '[y:0: — DOD] elyi PH} (3.1) 


where g; are scale parameters, and the means are obtained as 


E( y) = p; = b'(0;) 
and variances as 


V(yi) = p;b"(8;) = @; V (m) 


Poisson and binomial densities have a fixed scale factors »; = ẹ = 1. So for the Poisson, 
with b(0;) = exp (0;), the mean and variance are both p,. 

Often, though, as also noted in Chapter 2, residual variability under these densities 
exceeds that expected under the postulated variance-mean relationship. The 
variance function remains as above, but now ọ > 1 for all subjects, or possibly disper- 
sion factors vary by subject with ọ; > 1, and are themselves modelled as functions of 
covariates. 

One approach to handling such heterogeneity in general linear models involves 
random effects modelling in the model for the mean response. Thus for a Poisson 
outcome, y; ~ Poi(w;) we might stipulate a model for the mean p; containing both 
fixed and random effects: 


log (m;) = BX; &; 


where the g; are parametric (e.g. Normal) or possibly semi-parametric (e.g. where a 
DPP prior for the g; is used with a Normal baseline density). This effectively means 
adding a set of parameters which increase in number with the sample size, technically 
making the likelihood nonregular and raising the question about how many parameters 
are actually in the model (see Chapter 2 on assessing the number of effective param- 
eters). 

Alternative approaches to overdispersion include reparameterisations of the variance 
function, and generalised density and likelihood forms such as the double exponential 
that model both mean and variance functions (Carroll and Ruppert, 1988; Efron, 1986). 
Nelder and Pregibon (1987) propose a ‘quasi-likelihood’, which in logged form for one 
observation is 


—0.5D( yi, wi)/9; — 0.5 log [27; Vu] 
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where D(,) is the deviance function’. These are quasi-likelihoods because they do not 
correspond to any complete likelihood? within the exponential family (Albert and 
Pepple, 1989). West (1985) has, however, proposed a Bayesian approach involving a 
scaled, and hence complete, likelihood 


fries 6) =y expl- ¥,DO% mw) (3.2) 


where y; =A/(A__4,), and A reflects the level of over-dispersion. As à tends to infinity, 
and y; approaches 1, the scaled likelihood approaches the baseline exponential density. 
Various parameterisations of the variance function are possible: for instance, Engel 
(1992) has suggested 


Vy) = PV (u) = piu; (3.3) 
where 
log (9;) = NW; 


and W; contains covariates relevant to explaining extra-variability. 
Breslow (1984) and others have proposed a related approach where a pseudo-likelihood 
for each observation has the logged form 


—0.5( 7 — w)? /lp; Vu) — 0.5 log 27@; Vu] (3.4) 


The pseudo-likelihood approach is thus equivalent to assuming a Normal outcome but 
with modified variance function. Ganio and Schafer (1992) and Dey et al. (1997) are 
among those adopting these forms or extensions of them, and relating the variance 
inflators ; to predictors; Nelder and Lee (1991) term such models joint GLMs. 

Another methodology in instances of over-dispersion involves generalising the stand- 
ard variance function for particular densities. Thus, for a count regression with mean 
u; = exp (Bx;), Winkelmann and Zimmerman (1991) propose a variance function 


Vn) =(@- Dp pw, (3.5) 


where w > 0 and k > —1. Then œw = | corresponds to the Poisson, and w > 1 to over- 
dispersion. It can be seen that the particular value k = 0 means the Poisson variance p; 
is inflated by the factor w. This type of ‘linear’ adjustment has been proposed by 
McCullagh and Nelder (1989) as a correction procedure for over-dispersion. 

At its simplest, a correction for linear over-dispersion involves dividing the fitted 
deviance or chi-square for a Poisson regression by the degrees of freedom and deriving a 


' For example, in the Poisson the deviance for case i is 
2{yi log W;/u) — Oi — Ba) 
while in the binomial, with y; ~ Bin(p;, T;) and u; = p;T;j, it is 
2{yilog Q:/u) (T; - yi) log (Ti; — vil/[T: — wi) 


? Tt may be noted that in BUGS introducing synthetic data Z to represent densities not included does not 
necessarily require a complete likelihood. For instance, a quasi likelihood, with variance as in (3.3) and 
covariate W[i], could be coded as 

for (i in 1:N) {Z[i] <- 1; Z[i] ~ dbern(q[i]) 

qli] <- exp(Q[i]) 

Qfi] <- —0.5*log(6.28*phi[i]*pow(muf[i], kappa) ) — 0.5*D[iJ/phifi] 

log(phi[i]) <- eta[1]  eta[2]* W[i]} 
with D[i] being the relevant deviance. 
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scaling factor (i.e. estimating œw when K = 0) to adjust the standard errors of the B, for 
over-dispersion. The scaling factor w can also be obtained by a pseudo-likelihood model 
assuming Normal errors, with means p;, and variances wy,;. Another variance function 
with k = 1 occurs in the event of a specific type of gamma-Poisson mixture (as con- 
sidered in Chapter 2), when y; is marginally a negative binomial (see McCullagh and 
Nelder, 1989, Chapter 11). In fact, both the linear and quadratic variance forms (with 
k = 0 and k = 1, respectively) can be derived from mixing a gamma with the Poisson, 
but differ in the way the gamma is parameterised. 


3.2 CHOICE BETWEEN REGRESSION MODELS AND SETS OF 
PREDICTORS IN REGRESSION 


Many methods have been discussed for comparing the structural form and specification 
of regression models, including Bayesian methods, likelihood based methods, penalised 
likelihoods (Akaike, 1978), supermodel methods (Atkinson, 1969), bootstrap methods 
and cross-validation techniques (Browne, 2000). Very often, model selection schemes 
combine elements of different basic approaches. 

Model uncertainty in regression analysis may involve different aspects of model 
specification: 


(a) the error structure of the residuals (e.g. Normal vs. Student 1); 

(b) whether or not transformations are applied to predictors and outcome, which may 
be appropriate, for example, if metric data are skewed before transformation or to 
induce approximate Normality in a count outcome; 

(c) which link to adopt for discrete outcomes in general linear modelling, since the 
standard links are not necessarily always appropriate; 

(d) whether a general additive or non-linear regression term be used or just a simple 
linear expression; 

(e) which is the best subset of predictors in the regression term, regardless of its form; 

(f) for discrete mixture regressions, there is an issue of optimal choice of the number of 
components in the mixture. 


Many of these questions can be addressed by the model choice techniques considered in 
Chapter 2, for example using approximations to the marginal likelihood, using joint 
model-parameter space methods, using penalised measures of fit such as the BIC or DIC 
or using predictive (cross-validatory) criteria. 

In addition to the marginal likelihood approximations discussed in Chapter 2 may be 
mentioned the Laplace approximation. This is useful for standard regressions, such as 
Poisson, logistic or Normal linear models, since it requires differentiation of the log- 
likelihood with regard to all model parameters, and so may be difficult for complex 
nonlinear or random effect models. The Laplace approximation is based on the Taylor 
series expansion for a q-dimensional function g(x), x2, . . Xq), such that 


[exp [g(x] = (2m)1/2|S]"5 exp [e(3)] 3.6) 


where £ is the value of x at which g is maximised, and S is minus the inverse Hessian of 
g(x) at the point £. The Hessian is the matrix of double differentials 0°g¢/0x;0x;. For a 
model with parameters b, the marginal likelihood has the form 
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m(y) = [oomoo 


= | exp[logf(ylb) log m(y]db 


and in this case, the Laplace approximation may be obtained at the posterior mode Q 
(though it may also be defined at the maximum likelihood point b). For a univariate 
regression, with b = (o’, B), and q=p 2, where p is the number of regressors (ex- 
cluding the constant), the approximation is therefore 


m(y) & (T) PISPA bml) 


In this approximation, m() is the probability of the values in evaluated in terms of 
the assumed prior densities 7(), and S is minus the inverse Hessian of 


(b) = logf (yl) logni) 


evaluated at the posterior mean ẹ of the conditional variance and regression coeffi- 
cients. Then the log of the marginal likelihood is approximated as 


log[m(y)] = 0.5qlog (2m) 0.5log|S| logf(y|b) log mo) 


In practice (see Raftery, 1996), S may be estimated by minus the inverse Hessian of the 
log-likelihood at the mode, L(|y) = log f(y|¢), rather than that of g(). An approxi- 
mation might also be based on the variance-covariance matrix of pọ based on a long 
MCMC run, though this may be less valid for if regression parameters have non- 
Normal posterior densities (Diciccio et al., 1997); other options include multivariate 
interpolation (Chib, 2000). 


3.2.1 Predictor selection 


While approaches based on the marginal likelihood have been successfully applied 
to regression model selection, the structure of regression models is such that distinctive 
methods have been proposed for some of the particular model choice questions occur- 
ring in regression. One of the major motivations for seeking optimal or parsimonious 
models occurs if there is near collinearity between predictors, X1, X2,..., Xp. Taken asa 
single predictor, the coefficient 8; for predictor X; may have a clearly defined posterior 
density in line with subject matter knowledge (i.e. a 95% credible interval confined to 
positive values, assuming X; was expected to have a positive effect on y). However, with 
several predictors operating together coefficients on particular X; may be reduced to 
‘insignificance’ (in terms of a 95% interval neither clearly positive nor negative), or even 
taking signs opposite to expectation. 

Choice among p predictors to reduce such parameter instability can be seen as a 
question of including or excluding each predictor, giving rise to choice schemes specify- 
ing priors on binary indicators y; relating to the probabilities of inclusion Pr(y; = 1) or 
exclusion Pr(y; = 0) = 1 — Pr(y; = 1) of the jth predictor. Note that exclusion may be 
defined as ‘exclusion for all practical purposes’. 

Suppose a Poisson density with mean p; is assumed for count outcome y;, so that if all 
predictors are included in a log link 


log (m;)= Bo BiXn BoX2 --- BpXp 
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Then in a predictor selection model, we introduce between 1 and p binary variables y; 
relating to predictors about which inclusion is uncertain (the intercept is presumed 
always necessary). If uncertainty is applied to all predictors, the model becomes 


log (m;) = Bo YBiXa YB2X2 ---  YpBpXp (3.7) 


George and McCullough (1993) propose a stochastic Search Variable Selection Scheme 
(SVSS), whereby B, has a vague prior centred at zero (or some other value) when y= l, 
but when y, = 0 is selected the prior is centred at zero with high precision 0 e. B; is zero 
for all practical purposes). When Ta = 0 one might make? the variance T small, but 
multiply this by a large constant G when y; = 1. Typically, Bernoulli priors with 
probability 0.5 are assigned to the probability of each selection index y, being 1. An 
MCMC analysis would then be carried out and the frequency of different combinations 
of retained predictors enumerated; there are 2? possible combinations in Equation (3.7). 

Another scheme which takes the priors for binary indicators y, and coefficients B, as 
independent is presented by Kuo and Mallick (1998). Thus in Equation (3.7) one option 
is that the B, are assigned conventional priors, such as 


B; ~ NO, Vj) 


with V; large, rather than mixture priors as in the SVSS approach. This independent 
priors approach allows the possibility of more particular model elements than 
just whether predictors should be included or not. Thus for a log-linear regression of 
counts yj on factors A;, B; and their interaction Cj, one would not generally include Cj 
unless both A; and B; were included. So coding the y; to reflect this, with yy ~ Poi(yj), 
leads to 


log(uy) =a max(y;, Y3) Bi[4i] max (y2, Y3) B2[B;]  y3B3[Ci] 


so that y3 = 1 corresponds to including A;, B; and Cj. Kuo and Mallick (1998, p. 72) 
also ee a dependent priors scheme, whereby B; ~ N(O, V;) when y; = 1, but 
B; ~ NO, k; V;) when y, = 0, such that k; V; is close to zero. 


3.2.2 Cross-validation regression model assessment 


In general, cross-validation methods involve predictions of a subset y, of cases when 
only the complement of y,, denoted yp (i.e. the remaining observations) is used to 
update the prior on parameters þ. Here ¢ for a linear regression might consist of 
regression parameters and a variance term. Suppose an n x p matrix X of predictors is 
also partitioned into X, and X;. A discussion of cross-validation principles in Bayes 
regression model selection is provided by Geisser and Eddy (1979). These authors 
suggested a marginal likelihood approximation based on leaving one observation out 
at a time. The links (in terms of asymptotic equivalence) between this single-case 
omission strategy and penalised fit measures such as the AIC — where such measures 
are based on fitting to the complete sample — are discussed in earlier work by Stone 
(1974, 1977). 

Consider a procedure where cases are omitted singly, and let yj) denote the data 
excluding a single case r. Then the cross-validatory predictions have the form 


3 George and McCulloch suggest cj between 10 and 100 and q; = 8;/(2 log Gf, where ò; is the largest value 


for |B,| that would be considered unimportant. One might take 3; = Ay/AX;, where Ay is say under 0.5 of 
s.d.( y) but AX; is the range of Xj. 
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with p(y,| Ya) often known as the Conditional Predictive Ordinate (CPO). A proxy for 
the marginal likelihood m( y) is defined by the product of these terms 


my) = | [rovo 
r=1 


The ratio of mı( y) and m( y) for two models Mı and Mh) is then a surrogate for the 
Bayes factor Bj. Cross-validation methods also play a major role in regression model 
checking in identifying influential cases and other model discrepancies (see Example 
3.15). Gelfand et al. (1992) propose several checking functions involving comparison of 
the actual observations with predictions from p(y, |y). 

Rust and Schmittlein (1985) and Fornell and Rust (1989) propose a cross-validatory 
scheme for regression analysis including Bayesian model selection criteria, but drawing 
on split-half and other sample splitting methods also used in frequentist inference. Rust 
and Schmittlein emphasize that the choice of validation function may be based on 
subject matter choices (e.g. predicted vs. actual marketing share in the sales applications 
they consider). They consider cross-validation to compare models j = 1,..J via random 
splitting of datasets. Let Dı and D> be such a split, with Dı being the pre-sample on 
which parameter estimation and D; is the post or validation sample, used to assess the 
model. Let $; denote the parameter estimate under model j and ¢,; the estimate of $; 
using just the data in Dı. Then Rust and Schmittlein propose that the posterior 
probabilities of each model be approximated as 


P(M,|D) = P(D2|M;) (M,)/ PCD») 
= n(M,) | n()) fDa|e,)a,/ 


J 
>» TMi) | a(x) roses] (3.8) 


k=1 


J 
~ f(D) ni} So a(Mx) fosd) 


k=1 


Rust and Schmittlein take bi to be the maximum likelihood estimator, but one might also 
take the posterior average of the parameters under model j, ġ;. Another option, given 
equal prior model probabilities, would involve comparing the average log-likelihoods of 
the validation data (evaluated over a long run of T iterations) under model j 


T 
=A f(Dr\o%?) 
t=1 


Fornell and Rust consider the measure (3.8) for a single random split of the original 
data into sets {D,, D2}, but one might consider a large number r = 1,.. R of such splits 
and carry out such comparisons by averaging over the sets {Dj,, Da}. Thus, Alqallaf 
and Gustafson (2001) consider cross-validatory checks based on repeated two 
fold data splits into training and validation samples. Another option is k fold validation, 
where the data is split into a small number of groups (e.g. k = 5) of roughly equal size, 
and cross-validation is applied to each of the k partitions obtained by leaving each 
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group out at a time. Kuo and Peng (1999) use such an approach to obtain the predictive 
likelihood for the gth omitted group (the validation group), and suggest a product of 
these likelihoods over the k partitions as a marginal likelihood approximation. 


Example 3.1 Nodal involvement To illustrate model choice under marginal likelihood 
and cross-validation approaches, consider binary outcomes on prostatic cancer nodal 
involvement and four possible predictors (Collett, 1992). These data illustrate a frequent 
problem in model fitting: simple unpenalised fit measures such as the deviance may 
show a slight gain in fit as extra predictors (or other parameter sets) are added but 
penalised fit measures, or marginal likelihood type measures, show that any improve- 
ment is offset by extra complexity. Cross-validation approaches may also favour the less 
complex model in such situations. 


In the nodal involvement example, the predictors are xı = log(serum acid phos- 
phate), x. = result of X-ray (1 = ve, 0 = —ve), x3 = size of tumour (1 = large, 
0 = small) and x4 = pathological grade of tumour (1 = more serious, 0 = less serious). 
A probit regression model including all predictors is then 


yi ~ Bern(m;) 
probit(7;) = Bo Bixu -.-- B4X4i 


The model may be fitted directly or by introducing latent continuous and Normally 
distributed responses as in Albert and Chib (1993). Using the latter method, Chib 
(1995) shows that a model with x;—x3 only included has a worse log likelihood than a 
model including all four predictors (—24.43 as against —23.77) but a better marginal 
likelihood (—34.55 vs. —36.23). Letting M> denote the full model (four predictors) and 
Mı the reduced model, then the approximate Bayes factor Bı2 obtained by Chib is 
exp (1.68) = 5.37. By conventional criteria on interpreting Bayes factors (Kass and 
Raftery, 1996), this counts as “positive evidence’ for the reduced model, though far 
from conclusive. 

Here alternative approaches to marginal likelihood approximation and model choice, 
as discussed above and in Chapter 2, are considered. As noted in Chapter 2 the marginal 
likelihood identity 


m(y|Mj) = p(y|B, Mj)7(B|M;)/7(Bly, Mj) 


applies at any value of B, and in particular at points such as the posterior mean B. So 
(omitting M;), one may estimate log [m(y)] as 


log {m[y]} = log{p(y|B) log{m(B)} — log{m(Bly)} (3.9) 


4 Thus, suppose X; denotes a set of predictors, with regression parameter B. If y; is 1 then the latent response z; 
is constrained to be positive and sampled from a Normal with mean BX; and variance 1. This is equivalent to 
assuming a probit link for the probability that y; = 1. For the observed outcome y; = 0, z; is sampled in the 
same way, but constrained to be negative (with 0 as a ceiling value). The latent variable approach is especially 
advantageous for the probit link, and has benefits (e.g. for residual analysis). In BUGS, the following code (for 
observations y[] and one covariate x[]) might be used 
for (i in 1:n) {z[i] ~ dnorm(muf{i],1) Idow[yfi] 1),high[yfi] 1); 

mufi] <- b[1] b[2]*x[i]} 
# sampling bounds 

low[1] <- — 20; low[2] <- 0; high[1] <- 0; high[2] <- 20; 
A logit link is approximated by sampling z from a Student ¢ with 8 degrees of freedom; this is equivalent to 
Normal sampling, as in the above code, but with the precision of 1 replaced by case-specific precisions sampled 
from a Gamma density with shape and index both equal to 4. 
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The main issue is then how m(B|y) is approximated. If augmented data z is part of the 
model definition, and there is just one other parameter block (e.g. as here where there 
are latent Normal responses underlying an observed binary response, and the only other 
parameters are regression coefficients), then one may estimate n(B|y) using draws 2 
from a sampling run where all parameters are updated, and the probability of the fixed 
value B (obtained from an earlier run) is evaluated against the full conditional density 
for B given z®. The latter density is derived essentially by considering z as dependent 
variable in a Normal errors model (Chib, 1995, 2000). If the prior for B = (Bo, Bi,--, Ba) 
is denoted B ~ N,(co, C3 !), then the relevant full conditional is 


N,(B, C7!) 
where C= Cy X'X, with X of dimension n x p, and 
B=C Coco X'z] (3.10) 


with z being the vector of sampled z;. If Co is diagonal, then the prior reduces to a set of 
univariate Normals. 

Multivariate Normal or Student approximations to the posteriors t(B|y) may also be 
adopted to estimate the posterior ordinate in Equation (3.9), with covariance matrix 
based on correlations between the sampled BO, where k = 0, 1,.., 4 for model 2 and 
k = 0,.., 3 for model 1. Then 7(B|y) is the value of the multivariate Normal or Student 
density evaluated at its mean ß. 

Harmonic mean and importance sample estimators of the marginal likelihood (see 
Chapter 2) may also be considered. To examine stability in these estimators, the directly 
estimated probit model (without augmented data z) is fitted, and seven batches of 2000 
iterations are taken from a single run of 15000 after excluding the first 1000 (see 
Program 3.1(A), Model A). Priors are as in Chib (1995). The importance sample 
estimate is as in (2.9) 


fay) = 1T X gO ALOT”) 
= T/[X_ en) (3.11) 


where L® and n are the likelihood and prior densities evaluated at iterations 
t=1,.., T, and g® is the value of an importance function intended to approximate 
the posterior density m(B|y). This function is provided by a multivariate Normal 
approximation. 

Table 3.1 accordingly shows the sort of fluctuations that occur in the harmonic mean 
estimates (Equation (2.8) in Chapter 2) of the marginal likelihood. This involves 
monitoring the total log likelihood (Z in Program A, Model 3.1A), using the coda 
facility to extract runs of 2000 values, exponentiating L” to give H®, taking the average 
h of h® =1/H, and then taking minus the log of h. Comparison of the average 
marginal likelihood estimates (averaging over batches) gives Biz = 1.81. The import- 
ance sample estimates of the marginal likelihoods, as calculated in Equation (3.11), 
are —35.20 (Model 1), and —36.89 (Model 2), and so B12 = 5.42. These estimates of the 
log marginal likelihood are virtually identical to those obtained from the application 
of the marginal likelihood identity (3.9) in Model C of Program 3.1(A) (at posterior 
mean) using the MVN approximation to the posterior of B. At the posterior mean, the 
estimates are —35.19 and —36.89 for Models 1 and 2. 
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Table 3.1 Probit models for nodal involvement, regression coefficients 
and marginal likelihood approximations 


Model 1 with 3 Mean St. devn. 2.50% Median 97.50% 


predictors 

Bo —0.74 0.40 —1.50 —0.74 0.07 
Bi 1.42 0.67 0.17 1.40 2.79 
B2 1.30 0.48 0.40 1.29 2.22 
B3 1.08 0.43 0.25 1.08 1.92 


Model 2 with 4 predictors 


Bo —0.79 0.42 —1.64 —0.78 0.01 
By 1.63 0.70 0.30 1.63 3.07 
B2 1.26 0.49 0.32 1.26 2.21 
B3 0.96 0.45 0.10 0.96 1.84 
B4 0.55 0.45 —0.33 0.54 1.45 


Marginal likelihood estimates by iteration batch 


Harmonic Mean CPO Method 
Iterations Model 1 Model2 Modell Model 2 
1001-3000 —29.86 —28.85 —29.20 —29.85 
3001-5000 —28.41 —30.83 —29.18 —30.03 
5001-7000 —28.89 —28.47 —28.90 —29.88 
7001-9000 —28.55 —30.23 —29.07 —29.87 
9001:11000 —28.47 —28.60 —29.15 —29.50 
11000:13000 —28.33 —29.35 —28.91 —29.78 
13000-15000 —29.25 —29.58 —29.28 —29.86 


Program 3.1(B) follows Albert and Chib (1993), and takes the latent data z from the 
distribution z|y. Then B is evaluated against N,(6, C~!) for the samples 2 by substi- 
tuting in Equation (3.10). This gives a marginal likelihood estimate of —34.04 for Model 
1 and —35.28 for Model 2, a Bayes factor of 3.46; see Model B in Program 3.1(B). 
Model A in Program 3.1(B) illustrates the basic truncated Normal sampling needed to 
implement the Albert—Chib algorithm for probit regression. 

The CPO estimate (Equation (2.13) in Chapter 2) is obtained by taking minus logs of 
the posterior means of the inverse likelihoods (the quantities G[] in Program 3.1A, 
Model A), and then totalling over all cases. This estimator leads to Bi. = 2.06, and is 
more stable over batches than the harmonic mean estimate. 

A Pseudo Bayes Factor (PsBF) is also provided by the Geisser-Eddy cross-validation 
method based on training samples of n — 1 cases and prediction of the remaining case. 
Program 3.1(C) (Nodal Involvement, Cross-Validation) evaluates 53 predictive likelihoods 
of cases 1, 2, 3,.., 53 based on models evaluated on cases {2,.., 53}, {1, 3, 4,.., 53}, 
{1, 2, 4,.., 53),....{1, 2,..., 52}, respectively. Each omitted case y; provides a predict- 
ive likelihood and Bayes factor under model j (where here j= 1, 2), based on the 
components 


Hile), TABLA) 
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where ya denotes the sets of cases omitting the ith, namely 


Vi, y2,- . Yi-1, Yi ls- -Yn 


The ratios of the predictive likelihoods for case i (when it is the omitted case) under 
Models 1 and 2 may be denoted 
bi 

and their product bj is a form of Bayes factor. Since bi2 is skew, one may sample 
log (b12) in an MCMC run with validation models running in parallel, and then take 
the exponential of the posterior mean of log (b12) as an estimate of the PsBF. Addition- 
ally, if it is possible to run the n validation models in parallel and monitor log (b12) to 
find it exceeds 0 with a high probability, then one may say Model 1 is preferred to Model 
2. For numeric stability, a logit rather than probit link is used in Program 3.1C. 

From iterations 500-2000 of a run of 200 iterations (pooling over three chains), 
Pr(log (b12) > 0) converges to around 67%, while the posterior mean of log (b12) stabil- 
ises at around 1.77 (i.e. PsBFı2 = 5.87). One may also monitor the total predictive 
likelihoods involved (TL[] in Program 3.1C), and compare their posterior averages or 
medians to obtain an alternative estimate of Bj. The mean values of these likelihoods 
are —32.33 and —34.07, giving PsBFj2 = 6.4. Overall, the cross-validation methods 
provide slightly greater support for the reduced model than the approximate marginal 
likelihood methods dof, but both approaches support a more parsimonious model, 
whereas a simple deviance comparison might lead to adopting the complex model. 


Example 3.2 Hald data A data set from Hald (1952) refers to the heat evolved in 
calories per gram of cement, a metric outcome for n = 13 cases. The outcome is related 
to four predictors describing the composition of the cement ingredients. These data may 
be used to illustrate predictor selection in regression modelling. Options include 


e the SSVS scheme of George and McCulloch (1993), and related approaches; 

© true cross-validation methods based on omitting portions of the data as validation 
samples; and 

e predictive model assessment based on sampling ‘new data’ Z from a model fitted to 
all cases, and seeing how consistent the new data are with the observations. 


Here we adopt the third approach, and undertake a simultaneous analysis of the eight 
models by declaring repeat versions of the outcome, and evaluate them using a predict- 
ive criterion derived from the work of Laud and Ibrahim (1995). As discussed in 
Chapter 2, for model k and associated parameter set 6;, the predictive density is 


p(Zly, M =k) = [Zomo 


The predictive criterion of Laud and Ibrahim is then 


n 


C =X KEZ) - y}? var(Z))] 


i=1 


> A test of the cross-validation approach with the probit link provided similar differentials between Models 
1 and 2 (e.g. in terms of predictive likelihoods), but is subject to occasional numerical errors when the linear 
regression term becomes too large. 
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Better models will have smaller values of C? or its square root, C. Table 3.2 compares 
the eight (27) possible models in terms of C. 

Laud and Ibrahim employ a specialised prior, namely a guess at the fitted regression 
outcome for each case in terms of the available predictors; this is then translated into 
priors on the regression coefficients. Here non-informative N(0, 10°) priors are assumed 
on the regression coefficients themselves, and a G (1, 0.001) prior for the conditional 
variance. Despite this, similar C measures to those reported by Laud and Ibrahim (1995, 
Table 1, p. 255) are obtained. Taking into account variability in the criterion C, there is 
no overwhelming evidence against any model. 

However, on the Occam’s Razor principle that a Model 1 which is embedded 
in Model 2 (i.e. is nested within it and less heavily parameterised) and also has a 
better average C than Model 2, we can eliminate the models {x1, x2, x3, x4} and 
{x2, X3, X4}. 

Comparisons of the predictive criteria C; and C; between models i and j over repeated 
samples is carried out in Program 3.2 via the step() command; thus Compfi, j] is 1 if 
model j has a worse criterion than model i in a particular iteration. On this basis, it can be 
confirmed that there is essentially no clear advantage in fit for any model in Table 3.2. 
For example, Comp[1, 2] averages 0.51 from the second half of a run of 20 000 iterations 
over three chains, so that Model 2 has a worse fit than Model 1 in 51% of 30000 
iterations. Of the 28 = 8 x 7/2 model comparisons, the highest (comparing models 
1 and 7) is 0.70 and the lowest is 0.46. 


Example 3.3 Ship damage Both Winkelmann and Zimmerman (1991) and Dey et al. 
(1997) consider the ship damage data of McCullagh and Nelder (1989, p. 205). These 
data illustrate approaches to overdispersion in discrete outcomes, and the questions of 
model selection involved in introducing extra parameters to account for overdispersion. 
While a gamma-Poisson mixture 1s a standard option, this is not the only choice, and the 
goal is to use as parsimonious model as possible to ensure overdispersion is effectively 
allowed for. Hence model choice is better based on penalised measures of fit than 
unmodified likelihood and deviances. 


Excluding null observations, the data consists of counts y; of wave damage to 34 
cargo ships according to ship type t; (A-E), year of construction c; (1960-64, 1965-69, 
1970-74, 1975-79), and period of observation o; (1960-74, 1975-79). Most analyses 


Table 3.2 Predictive fit of models for Hald data (iterations 10 000- 
20000, three chains) 


Model (included Mean C s.d. C Median C 


predictors) 

1,2,4 11.40 2.73 11.04 
1,2,3 11.45 2.71 11.09 
1,3,4 11.79 2.82 11.41 
1,2,3,4 11.85 2.89 11.45 
1,2 11.98 2.72 11.65 
1,4 13.60 3.07 13.22 
2,3,4 14.11 3.33 13.68 


3,4 13.57 3.08 13.20 
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treat these as categorical factors, and take the months of service total s; as an offset with 
coefficient 1. Thus, the model for damage counts is 


yi ~ Poilu) 
log (m;) = log (s) aœ Bt) ye) (o) 


First, consider a standard Poisson regression, with N(0, 10°) priors on the regression 
coefficients @ = {a, B, y, 5}, and taking appropriate corner constraints on {B, y, ò}. 
Unpenalised fit measures are not ideal for model choice, but do indicate whether over- 
dispersion is effectively corrected for. They include the mean Poisson deviance D}, and a 
chi-square measure based on standardised residuals, 


Dr =X RO w) = DOIG: - BY /V ODI 


We can also calculate the deviance at b and obtain the effective parameters pe, and so 
obtain the DIC 


D3=D,(o) 2pe= Di pe 


Another penalised fit measure is provided by the deviance criterion (for Poisson data) 
under minimum predictive loss (Gelfand and Ghosh, 1998), which, like the criterion 
used in Example 3.2, involves sampling new data Z; from the model means p;. 

The baseline Poisson regression model, with nine parameters (and 25 degrees of 
freedom) is estimated using three chains” taken to 100000 iterations and with the 
summary excluding 5000 burn-in iterations. This model suggests over-dispersion, 
albeit not pronounced, with the mean value of Dı standing at 42 and that of D» at 
55. The deviances are highest for ships 19, 20, and 31 with 6, 2 and 7 damage incidents, 
respectively, two of these ships (19, 31) have relatively short exposure periods. 
The evaluation of the deviance at gives Dı()= 33.6, so pe=8.4 and 
D3 = 33.6 2(8.4) = 50.4. The deviance criterion under minimum predictive loss is 
55.4. 

The Poisson regression (see Table 3.3) shows lower damage rates for ship types B and 
C (type A is the first level in the categorical factor and has coefficient zero). Damage 
rates are higher for later construction and observation periods. 

McCullagh and Nelder (1989) suggest a variance inflator (based implicitly on a linear 
variance model with k = 0 in (3.5)) of 1.69, namely w = 1.69. To formalise this, one 
may adopt the Normal errors model of the pseudo-likelihood (3.4), as set out in Model 
B in Program 3.3. Thus the damage counts are taken as Normal with mean p; and 
variance 


OM; 
where w is assigned an &(1) prior. This formulation in fact allows for under-dispersion, 
Vi) < Bi 


as well as for overdispersion. w then has a posterior mean of 1.73, a median of 1.59 and 
95% interval 0.96 to 3.1, consistent with the analysis of McCullagh and Nelder. This 


© Let M; denote the posterior average of Z; and T; the posterior average of the Poisson deviance component 
(Z;) = Zilog Z; — Z;. Define Q; = (M; kyi)/(1__k), where k is positive; then the Gelfand-Ghosh measure 
is 23,[T—«(M)] 2k DXM) kyi}/{1 K} — (Qi). 

7 Starting values are provided by null values, values based on the posterior mean of a trial run, and values 
based on the upper 97.5% point of the trial run. 
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model enhances the coefficients for year of construction, but diminishes the size and 
‘significance’ of the coefficients for ship types B and C. The mean value of D; is slightly 
increased at 49, while the mean of D> is now 36. 

Then in Model C of Program 3.3 a general variance? function (3.5) is assumed, in 
conjunction with a Poisson-gamma mixture. Following Winkelmann and Zimmerman 
(1991), this is achieved as 


yi ~ Poi(y;) 

Bb; ~ GM Nei) 
niy; = exp (Bx;)'“*/(@ — 1) 
No; = exp (Bx) “/(w — 1) 


Three parallel chains are taken to 50000 iterations, with different starting values for k 
and œw. An E(1) prior onk 1 anda gamma prior, G (1, 0.01), on 


T=w-1 


are adopted. Note that the latter prior amounts to ‘forcing’ overdispersion. Conver- 
gence is relatively slow in t, and hence w, and posterior summaries are based on 
iterations 15000-50 000. 

This approach leads to a posterior means for the statistics D; and D2 of around 26 
and 19, respectively. Effective parameters are estimated as approximately 14, the devi- 
ance at the posterior mean being 11.9, so D3 ~ 11.9 28 = 39.9. This improves over the 
simple Poisson regression. The deviance criterion under minimum predictive loss is 
reduced to 49.8. 

There is clear evidence of overdispersion in the sense that w averages 2.94, though in 
this model the departure from the Poisson assumption is explicitly modelled. The 
posterior median of the power parameter k is —0.59, though with a long tail of values 
that straddle the value of zero; this corresponds approximately to a square root 
transform in Equation (3.5). Figure 3.1 shows how the posterior for this parameter 


Figure 3.1 Smooth for power parameter, K 


8 Fitted in version 1.2 of WINBUGS. 
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(based on iterations 20 000-50 000 over three chains, with sub-sampling every tenth- 
iteration) focuses on negative values near the minimum threshold. 

The credible intervals for the regression coefficients are little affected by this reformu- 
lation as compared to the Poisson regression, despite the allowance for overdispersion 
(Table 3.3). Note that Winkelmann and Zimmerman obtain a value for k of —0.74 and 
for œw of 1.85. 

Another possible modelling strategy, considered by Dey et al. (1997) and others, 
introduces a dependence of the variance for case i on predictors W;. These models 
may be substantively of interest, but may pose identifiability and interpretation prob- 
lems if, say, there are common predictors of both the mean and variance. Following 
Dey et al., the variance is a function of the log of the months of observation (Model D 
in Program 3.3). This variable is also in the model for the mean, and usually taken to 
have coefficient of unity (this means it becomes an ‘offset’, in line with a Poisson 
process). 

Here a gamma-Poisson mixture is again assumed with the inverse variances 9, of the 
gamma mixture linked to months of service. Thus, Model D involves the steps 


yi ~ Poi(p,) 
W; = vð; 


8; ~ G(g;, ;) 
logd; = a9 «a log(months service) 
log v; = log (months service) effects of ship type, etc. 


This is consistent with a quadratic variance function, with 
2 
VO) = wi Bi / 9; 


In fitting this model, N(0, 1) priors are specified on ap and a; for numerical stability and 
to improve convergence. 


Table 3.3 Ship damage data models 


Poisson (model A) General variance Function Variance regression 
(model C) (model D) 


Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5% 


Regression parameters 


Intercept 6.424 —6.863 —6.013 -—6.367 -—6.732 —5.920 —6.43 7.16 5:12 
Type B 0.536 —0.876 —0.167 —0.574 —0.953 —0.091 0.53 1.14 0.10 
Type C 0.714 1.398 —0.080 —0.751 1.601 0.042 —0.65 —1.60 0.31 
Type D —0.098 —0.683 0.470 —0.170 —1.005 0.551 —0.16 —1.11 0.79 
Type E 0.320 —0.143 0.807 0.252 —0.335 0.844 0.39 —0.43 1.28 
Built 1965-69 0.699 0.416 0.991 0.676 0.298 1.047 0.69 0.24 1.16 
Built 1970-74 0.820 0.499 1.147 0.787 0.380 1.222 0.86 0.38 1.43 
Built 1975-79 0.445 —0.024 0.900 0.430 —0.142 0.966 0.49 —0.19 1.20 


Period of obs’n 1975-79 0.388 0.150 0.624 0.387 0.119 0.683 0.39 0.01 0.76 
Other parameters 


K —0.52 —0.99 0.57 
w 2.94 1.09 9.16 
a 1.25 0.27 2.35 


al 0.88 0.07 1.71 
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The last 15 000 of a 20 000 iteration run with three chains show that there is a positive 
(negative) effect a; of months service on the precision (variance). The most marked over- 
dispersion (as expressed also in the deviances at observation level in the simple Poisson 
regression above) occurs when damage incidents are relatively high in relation to exposure 
periods. Posterior means on coefficients in the model for the means are similar to in the 
baseline Poisson model, though with wider credible intervals. The average deviance D; and 
chi-squared sum D, are reduced to 25.5 and 18, and all damage counts are well predicted. 

So in this example different model approaches with similar fit (in simple deviance 
terms) have effectively accounted for, or modelled, the over-dispersion but have differ- 
ent implications for the form of the variance function and the regression effects. A 
suggested exercise is to assess, (a) the effective parameters for the DIC, and (b) the 
deviance based criteria under minimum predictive loss, in Models C and D, and so 
attempt to discriminate between them. It should also be noted that the Poisson regres- 
sion analysis may be improved by allowing an interaction between ship type and year of 
construction (Lawless, 1987). 


Example 3.4 Sales territory While many goodness of fit criteria are ‘internal’ ap- 
proaches comparing predictions with actual data, cross-validation methods involve out- 
of-sample predictions as a way of model assessment. Rust and Schmittlein (1985) 
consider the application of the split sample Bayesian Cross Validation (BCV) method 
to 40 observations relating to marketing territory data for sales of click ball point pens 
(y, in $000s), advertising spots (X1), number of sales representatives (X2), and a 
wholesale efficiency index (X3). Two fold cross-validation is one option; other options 
are n-fold cross-validation, or k-fold cross-validation (k small but exceeding two). 


Two alternative sales prediction models are to be assessed via the split sample or two 
fold cross validation, namely linear and linear-in-logs forms: 


Y= Qo aX] ar X> a3X3 el 
Y= By, log(%1) Bz log(%2) B;log(43) e 


As Rust and Schmittlein note, these are non-nested models, the relative performance of 
which may be difficult to assess with classical methods. Note that we take sales in 
thousands of dollars divided by 100 provide a scaled outcome Y. This means likelihoods 
will differ from those cited by Rust and Schmittlein. 

Assuming Normal errors e, we compare penalised likelihood criteria for model choice 
involving a model fitted to all observations with a BCV check based on a single random 
split of the 40 observations into an estimation sample of 20 and a validation sample of 
20. This split is provided by Rust and Schmittlein, though Program 3.4 includes code to 
derive alternative random split halves of 40 observations. 

The BCV approach yields results as in Table 3.4, based on a fit to the 20 training 
cases and a log-likelihood comparing actual and predicted values of the outcome 
for the remaining 20 validation cases. The latter suggests a small advantage to the 
linear model: a posterior model probability of 2.46/(1 2.46)=0.71, where 
2.46 = exp(— 21.7 22.6). A suggested exercise is to: 


(a) estimate and validate in the ‘reverse’ direction, noting that there is no reason in 
particular to regard one sub-sample as the training sample and one the validation 
sample; and 
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(b) split the data randomly into four sub-samples and estimate the two models four 
times over cases in three sub-samples, validating each time over cases in the 
remaining sub-sample, and accumulating the predictive log-likelihoods. 


To derive standard penalised fit measures, the log-likelihoods at the posterior 
means of the parameters (i.e. the regression parameters and error variance) are calcu- 
lated under the two models and penalised according to p = 5 parameters under the AIC 
or 5p/2 x log (40) = 9.22 under the BIC. The respective log-likelihoods, following 
runs of 10000 iterations under the linear and log-linear models, are —22.2 and —25.7, 
so the linear model is also preferred on the basis of a full data set analysis. Since both 
models have the same number of parameters, the use of AIC or BIC makes no difference 
to this conclusion. Hence, there is unanimity between model choice methods in this 
example. It is apparent from Table 3.4 that the inclusion of X3 is doubtful under either 
model, and the approach illustrated in Example 3.2 might be used to compare reduced 
models. 


Example 3.5 Laplace approximation to marginal likelihood: counts of schizophre- 
nia Example 3.3 considered simple and penalised deviance measures of fit for a 
Poisson count outcome, analogous to classical approaches. Here we consider the 
Laplace approximation as applied to model choice for Poisson regression, approximat- 
ing formal Bayes model choice based on marginal likelihood. In this approximation, the 
parameter dispersion matrix S in Equation (3.6) is estimated using the second order 
differentials of the log-likelihood of the data. 


Consider counts y; of inpatient hospitalisations of schizophrenic patients for 44 small 
areas (electoral wards) in two London boroughs with expected cases denoted E; (see 
Congdon et al., 1998). These are to be ‘explained’ in terms of measures of social 
deprivation and community structure. Thus, Model 1 relates psychiatric morbidity to 
a single predictor, the Townsend deprivation index (X1), and Model 2 combines the 
Townsend and anomie indices (X; and X2, respectively). Univariate Normal priors are 
adopted for the regression coefficients, namely b; ~ N(0, 100). 

Then y; ~ Poi(;), where the Poisson means p; are related to predictors via a log-link: 


log(u,) = log (E) bo biXu 2X2; 


Table 3.4 Click ball points: BCV analysis 


Linear Mean St. devn. 2.5% Median 97.5% 


log-likelihood —21.7 3.0 —28.7 —21.4 —16.8 

Ao 0.18 0.98 —1.81 0.19 2.10 
Ai 0.16 0.09 —0.02 0.16 0.35 
Ao 0.35 0.26 —0.16 0.35 0.87 
A3 0.10 0.23 —0.36 0.10 0.56 
Linear-in-logs 

log-likelihood —22.6 3.1 —29.8 —22.2 —17.6 

Bo —2.93 1.36 —5.60 —2.93 —0.26 
Bı 1.87 0.97 —0.01 1.86 3.75 
B2 1.63 1.27 —0.86 1.66 4.13 


B3 0.06 0.55 —1.04 0.06 1.16 
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Omitting constants, the log likelihood for area i is 
L(bly) = —p; yilog(p;) 
and the second-order differentials of L with respect to b; and bx have the form 
— Xi Xie bi 


We fit Model 1 and obtain posterior means and standard deviations bọ = —3.1 (0.35), 
bı = 1.27 (0.14). The total Poisson log-likelihood (F[1] in program 3.5, Model B) is 
—137.3, while the marginal likelihood is approximated via the Laplace method” is 
—147.2. 

Adding the predictor anomie produces a gain in likelihood which would be judged 
significant by usual methods, with L now at —133.8. The posterior means and standard 
deviations are 


by = —3.24(0.33), bı = 0.81(0.21) and by = 0.55(0.21) 


The size of the Townsend score coefficient is much reduced and its precision lessened, 
and the marginal likelihood suggests in fact no gain in fit, standing at — 147.5. So there is 
in effect nothing to choose between the models in Bayesian terms, with the (approxi- 
mate) Bayes factor close to unity. 

Support for a simple model including X; only is provided by applying the Kuo- 
Mallick dependent priors predictor selection method (Kuo and Mallick, 1998, p.72). 
Here the model is extended to include a third index, the mental illness needs index or 
mini[] in Program 3.5. Priors on the three regression coefficients are set so that a value 
y; = 0 on the binary index for the jth coefficient results in an effectively zero coefficient. 
With three chains, convergence on the coefficients is apparent from around iteration 
9000. The second half of a run of 40 000 iterations shows y; to have value 1 (X; included 
in every iteration), whereas y, is 0.09 and y, is only 0.006. 


3.3 POLYTOMOUS AND ORDINAL REGRESSION 


Many outcomes relating to political or religious affiliation, labour force or social status, 
or choice (e.g. of travel mode to work) involve ordered or unordered polytomous 
variables (Amemiya, 1981). The underlying model for such outcomes typically involves 
a latent continuous variable, which may be conceptualised as an attitudinal, prestige 


? The BUGS coding for the Laplace approximation involves the inverse( ) and logdet( ) commands. For a 
Poisson outcome y[], covariates X[,1:p], posterior mean parameter estimate b[1:p] from an earlier fit involving 
N(0,100) priors, and expected counts Ef], the coding for a non-iterative program could be: 
for (i in 1:N) { # Poisson mean 
log(mufi]) <- log(E[i]) | sum(C[i,1:p]) 
# log-likelihood 
Lii] <- —muf[i] —_ y[i]*log(mufi]) — logfact( y[i]) 
for (j in 1:p) { CHj] <- bIJI* X.J] 
for (k in 1:p) { Dfij,k] <- -X[i.j]*X[i,k]*muļi]}}} 
# inverse Hessian 
for (j in 1:p) { for (k in 1:p) { H[j,k] <- — sum(D[j,k]) 
S[j,k] <- inverse(H[,],j,k)}} 
# log probs of posterior mean values under Normal prior densities 
for (j in 1:p) { tau.b[j] <- 0.01; mu.b[j] <- 0; 
Pr[j] <- 0.5*log(tau.b[j]/6.28) — 0.5*tau.b[j]*pow(b[j] — mu.b[j],2)} 
# Laplace approxn 
F <- 0.5*p*log(6.28) 0.5*logdet(SL]) sum(L[) | sum(Pr[]) 
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or utility scale, that determines the observed rating or category of belief, status or 
choice. 

The multinomial logit and multinomial probit models generalise their binomial 
equivalents and are suitable — perhaps with modifications — to modelling multicategory 
outcomes which do not involve ordered categories. So just as the logit model for 
a binary outcome involves the log odds of a positive to a negative response, so a 
multinomial logit involves stipulating a baseline category (say the first of K possible 
outcomes) and comparing the probabilities of outcomes 2, 3,.., K against that of 
the first. 


3.3.1 Multinomial logistic choice models 


Covariates may be defined for individuals i, according to different choices j, or in terms 
of particular features of choice j which are unique to individual 7. Thus, in a travel mode 
choice example, the first type of variable might be individual income, the second might 
be the generic cost of alternative modes, and the third might be individual costs attached 
to different modes. Consider a vector of covariates X; specific to individuals i alone, and 
let y; = 1 if option j is chosen. Then for K possible categories in the outcome, we may 
specify a multiple logit model 


k=2 


K 
Pr( yy = 1) = py = exp (B; s/f X exp (B; wx) jal 
(3.12) 


K 
Pr(ya = 1) = pa = uhi X exp (Bx ww} 


k=2 
or equivalently, 
log {p/P} = Bj YjXi 
Also, for j and k both exceeding 1, 
log {pi/Pix} = (Bj — Be) O- VK) Xi 


so that choice probabilities are governed by differences in coefficient values between 
alternatives. Strictly, the above formulation (involving covariates constant across alter- 
natives j) is known as a ‘multinomial logit’, and focuses on the individual as a unit of 
analysis. 

If instead we consider attributes W; of the jth alternative specific for individual į, then 
a conditional logit model is obtained with 


K 
pi = exp (8W;)/ X exp (8 Wix) (3.13) 
k=1 


Dividing through by exp (ò W;;) gives 


K 
Py =1/ > exp [Wx — Wi) 
k=l 
In the conditional logit model, the coefficients 6 are usually constant across alternatives, 
and so choice probabilities are determined by differences in the values of characteristics 
between alternatives. A mixed model, combining features of both Equations (3.12) and 
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(3.13) would include both individual level attributes X; and alternative specific charac- 
teristics Wi. Thus 


log (pi /Pix) = (Bj) -— Ba) OY -—¥W)Xi Wy- Wik) (3.14) 


Multiple logit models can be expressed in terms of a model for individual choice 
behaviour. Thus, let z; be the unobserved value or utility of choice j to individual 7, with 


Zj = U(X, Sj, Wy, €y) 


where S; are known predictors for choice j (e.g. climate in state j for potential migrants 
to that state), and X; and W; are as above. The g; are random utility terms. Assuming 
additivity and separability of stochastic and deterministic components leads to 


Zij = Vii Bi (3.15) 
with a regression function such as 
vj=B; yX W; $ 
Then the choice of option j means 
Zj > Zk KAS 
and so 
Pi = Pr(zy > Zik) 
Equivalently 
ya = l if zj = max (Zi, Zi2,.-.. Zik) 
Assume the gj follow a type I extreme value (double exponential) distribution with cdf: 
F(e,) = | [,exp(— exp (— &y)) 
and if the assumption in (3.15) holds also, 
Pr( yy = 1|X;, Wy, Sj) = exp y)/ X, exp (Vi) 


with Bı = yı = 0 as in Equation (3.9) for identifiability. 


3.3.2 Nested logit specification 


A feature of the conditional model (3.10) is that the relative probability of choosing 
the jth alternative as compared to the kth is independent of the presence or absence of 
other alternatives (the so-called independence of irrelevant alternatives or IIA axiom). 
In practice, the presence of other alternatives may be far from irrelevant, and there may 
be similar alternatives (e.g. with similar utilities to the consumer) between which 
substitution may be made — see Congdon (2000) for an application involving patient 
flows to hospitals. We may consider a simple form of nested logit model adapted to 
account for real world departures from IIA (Poirier, 1996; Fischer and Aufhauser, 
1988). Thus, suppose there is one nest of K — 1 alternatives within which the simple 
conditional form (3.10) holds, and a single alternative m from the original set of K which 
is isolated from the nest. Let K[ — m] denote the set of alternatives excluding the mth. 
Then for choice by individuals and predictors Wj specific to i and j 
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Vi, ık ~ Mult(p;, ı:x, 1) 
For the isolated alternative 
Pim = exp (Wim)/{ exp (Wim) exp (al;) 


where J; is known as the inclusive value 


L= oe x osn(na/a} 


JjeK[—m] 
Within the nest defined by K[ — m], the second level choice probabilities are given by 
Pij = (l — Pim) Qj 


where 


qj = exp (Wyò/a)/ X. exp (Wxò/a) 
keK[=m] 


is the standard conditional logit confined to alternatives in the set K[— m]. The 
coefficient a has null value 1, in which case the model reduces to the standard condi- 
tional logit over all K alternatives and with unrestricted IIA applying. If a is between 
0 and 1 then substitution is greater within a nest than between nests. If substitution 
among nests exceeds substitution within nests, then a exceeds 1. In the example below 
we assume a is positive, though negative values of a are not invalid numerically. Poirier 
(1996, p. 172) suggests the reparameterisation \ = 8/a, and so ò = ad, with priors taken 
on \ anda. 


3.3.3 Ordinal outcomes 


The multinomial and conditional logit models make no assumptions about the 
ordering of a categorical outcome. However, ordinal response data are frequently 
encountered in the social and health sciences. In opinion surveys, respondents are 
often asked to grade their views on a statement on scales from ‘strongly agree’ to 
‘strongly disagree’. Health status or diagnostic gradings of disease are often measured 
on multicategory scales as ‘normal’ to ‘definitely abnormal’ (diagnosis) or as ‘good’ to 
‘poor’ (health). 

Among statistical issues raised by such data are the modelling of ordinal regression 
relationships (e.g. with predictors as well as outcome ordinal); whether an underlying 
latent scale assumption need be invoked (Armstrong and Sloan, 1989); and whether or 
not the ordering of responses is relevant to stratifying regression relationships, for 
example with different slopes according to each ordinal response category (Anderson, 
1984). 

The cumulative odds model is often expressed in terms of an underlying continuous 
response, though this is not strictly necessary. Thus McCullagh (1980) outlined the 
regression model for an observed ordinal response variable T; (with possible values 
1, 2,.. K) taken to reflect an underlying continuous random variable. For i= 1,..N 
respondents this model has a systematic component in the form of a cumulative 
probability 


Vy = Pr(T; < j) = FG; — p;) j=1,..K-1 
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or 
F! (yy) = 6; — p; 


Here w; = BX; is a regression term defined by covariates X;, 0; represents the cut point 
corresponding to the jth rank, and F is a distribution function. If the X are all 
categorical then i typically indexes subject groups defined by combinations of covari- 
ates. Special cases of F include the logistic (leading to a proportional odds model) and 
the extreme value distribution leading to a proportional hazards form. 

So if pj is the probability of lying in the jth ranked group, then y; = pa ~.. Pý- 
Conversely, the probabilities of an observation lying in the jth rank are given by 
differencing the cumulative probabilities: 


Pi >= Yi 
Pi = Vi — Yi, j-1 
Pix =1- Vi, K-1 
If F is a logistic, with 
Cj = logit(y;) = 0; — BX; (3.16) 


and B is uniform across response categories j, then the 9; are the logits of belonging to 
categories up to and including 1,..j (as against categories j 1,..K) for subjects with 
X = 0. The difference in cumulative logits for different values of X, for example X, and 
X>, is independent of j. This is known as the ‘proportional odds’ property!”, with 


Cy — Cy = BX — X1) 


3.3.4 Link functions 


Several authors have considered the issue of more general link functions for binary or 
ordinal outcomes than the logistic or probit links. Lang (1999) proposes a mixture of the 
symmetric logistic form for F and two asymmetric forms. The latter are the Left Skewed 
Extreme Value (LSEV) distribution 


F\(t) = 1 — exp(— exp ()) (3.17) 
and the Right Skewed Extreme Value (RSEV) distribution 
F3(t) = exp (— exp(— 0) 


10 In BUGS the essential coding for the proportional odds model applied to individual choice (and for one 
covariate X{/]) has the form, together with specimen priors: 
model{ for (i in 1:N) { for (j in 1:K-1) { # logit of cumulative probability of rank j or lower 
logit(gammali,j]) <- theta[j] — muli] } 
# probability of jth rank 
pli, 1] <- gamma [i, 1]; 
for (j in 2:K-1) { plij] <- gamma [i,j] - gamma [1,j-1] } 
pli,K] <- 1 — gamma [i,K — 1]; 
mu[i] <- b[1] — x{i]*b[2] 
yli] ~ deat(pfi,1:K])} 
b[1] ~ dnorm(0,0.001); b[2] ~ dnorm(0,0.001) 
theta[1] ~ dnorm(0,1) I(0,theta[2]) 
for (j in 2:K-1) {theta[j] ~ dnorm(0,1) I(theta[j-1],)}} 
Note that sampling of y[] could also be represented, with input response data appropriately changed, as 
yli, 1:K] ~ dmulti(pfi,1:K],1). 
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with F(t) being the logistic distribution 
F(A = exp(t)/(1_ exp (Ð) 


The distribution in (3.17) corresponds to the complementary log- log form of link for 
probability p: 


F~! (p) = log(— log (1 — p)) 
The mixture has the form 
Fy) = wiQFi(t) wN wA) (3.18) 


The mixture proportions on the LSEV and RSEV links in the model of Lang are given 
by 


w1(A) = exp ( — exp(3.5A_—.2)) 
and 
w3(A) = exp ( — exp(— 3.5 2)) 


respectively. They depend upon an additional parameter \ which has a Normal density 
with mean zero and known variance o?. The mixture proportion on the straightforward 
(i.e. conventional) logit link, intermediate between the asymmetric forms, is obtained as 
w2(A) = 1 — wi (A) — w). 

Lang outlines how the mixture proportions on the three possible links relate to the 
value of \, and its variance. Negative values of à are obtained when the LSEV form is 
preferred, and positive values when the RSEV is preferred; N = 0 corresponds to the 
logit link. Most variation in the mixture proportions {w)(A), w2(A), w3(A)} occurs when 
à is in the interval [—3, 3], and this leads to a prior on à with variance 5 as being 
essentially non-informative with regard to the appropriate link out of the three possible. 

For an ordinal response as in Equation (3.16), we then have 


Py = FAO; — BX) -A O- — BX). 


Other Bayesian approaches for general linear models with link functions taken as 
unknown include Mallick and Gelfand (1994). 


Example 3.6 Infection after caesarian birth We first consider a multinomial logistic 
model applied to data from Fahrmeier and Tutz (1994). In this example the outcome is 
multinomial but unordered and the predictors are categorical. Specifically, the outcome 
is a three fold categorisation of infection in 251 births involving Caesarian section; 
namely no infection, type I infection, and type II infection. The risk factors are defined 
as NOPLAN = 1 (if the Caesarian is unplanned, 0 otherwise), FACTOR = 1 (if risk 
factors were present, 0 otherwise), and ANTIB=1 if antibiotics were given as prophy- 
laxis. Of the eight possible predictor combinations, seven were observed. The numbers 
of maternities under the possible combinations range from 2 (for NOPLAN=0, 
ANTIB=1, FACTOR=0) to 98 (NOPLAN=1, ANTIB=1, FACTOR=1). N(0, 10) 
priors are assumed for the impacts of different levels of these categorical variables. 


The estimates for this MNL model, following the structure of Equation (3.9), show 
antibiotic prophylaxis as decreasing the relative risk of infection type I more than type II. 
Hence y» is more highly negative than y,3. By contrast, the presence of risk factors in 
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a birth increases the risk of type II infection more: the odds ratio of type II vs type I infection 
when a risk factor is present, measured by pi3/pi is exp(— 2.54 2.14)/ 
exp(— 2.61 1.78) or 1.54 (see Table 3.5). 

In fact we can use sampling to provide a density for such structural parameters, rather 
than obtaining “point estimates’ based on the posterior means of constituent param- 
eters. This involves defining new quantities sı = exp(B; y33) and s2 = exp (B>  Y32) 
and monitoring them in the same way as basic parameters. One might assess the 
probability that sı > s2 by repeated sampling also — for example, by using the step() 
function in BUGS. 

If there were not strong evidence that sı > s2, the coefficients on NOPLAN, ANTIB 
and FACTOR might be equalised since Table 3.5 suggests homogeneity across the two 
infection types. This would involve three fewer parameters, and may provide a better fit 
after allowing for reduced parameterisation. 


Example 3.7 Travel choice As an example of multicategory choice and possible 
departures from the ITA axiom discussed in Section 3.3.2, consider the travel mode 
choice data from Powers and Xie (2000). These relate to three choices (1=train, 2=bus, 
3=car) among N = 152 respondents, with 

yi, 13 ~ Multi(p;, 1:3, 1) 


where, for example, y; = 1 if a subject chooses the train and zero otherwise. Choice is 
modelled as a function of respondent income (an X variable, as in Section 3.3.1) and of 
destination attributes which differ by respondent (W variables). The latter are terminal 


Table 3.5 Risk factors for infection after Caesarian delivery 


Mean St. devn. 2.5% Median 97.5% 


B2 —2.610 0.605 —3.963  —2.565 —1.595 
B3 —2.537 0.519 —3.627 —2.507 —1.638 
NOPLAN 

yi2 1.075 0.492 0.112 1.085 2.066 
yi3 0.927 0.448 0.013 0.929 1.828 
ANTIB 

Yn —3.410 0.650 —4.722  —3.398  —2.208 
Y23 —3.012 0.535 —4.119 —2.999 —1.991 
FACTOR 

32 1.775 0.658 0.596 1.746 3.227 
Y33 2.140 0.556 1.108 2.127 3.317 


waiting time for bus and train (denoted TIME[] in Program 3.7, and zero by definition 
for car users), in-vehicle time (IVT[]), and in-vehicle cost (IVC[]). Log transforms of the 
four predictors are taken, though the analysis of PX was in their original scale. The log 
transformed predictor model gives a better log-likelihood than the untransformed 
predictor model. 
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A model with common regression coefficients across choices as in the conditional 
logit specification (3.13) is fitted first. This provides evidence that the probability of 
choosing a particular mode decreases as in-vehicle costs and in-vehicle time increase. 
Note that the effect of terminal waiting time is wrongly signed in substantive terms. 
Thus the log-odds of bus (Model 2) against the baseline category train (mode 1) may be 
obtained (see Table 3.6) as 


log (p2/pa) = 0.31(log[(TIME, 1)/(TIME; 1) 
— 13.3 log (IVC /IVC;1) — 0.58 log (IVT2/IVT1) 


where TIME2 is bus terminal time and TIME! is train waiting time. The same coeffi- 
cients apply for the log odds outcome log (p;3/pj1). 

Because of the unexpected direction of one coefficient, one might consider whether 
prior constraints or specialised priors to tackle the collinearity in the predictors. 
Constraining the coefficients to be negative while retaining all respondents results in a 
worsening of fit, with the average log-likelihood falling from —87 to —92. The absolute 
size of the coefficient on invehicle costs is reduced. While assessment of outlier 
or influential observations, or perhaps variable selection, might seem natural steps 
as well, it is necessary to consider whether specification might be improved (e.g. 
to allow for interdependent destinations or the impact of omitted individual level 
influences). 

As an alternative specification option, nesting of choices might first be considered. 
For illustration, the nested logit model with bus and train as alternatives within a nest 
of K — 1 = 2 choices, and with the car option taken as an isolated single alternative. 
Unconstrained priors on TIME, IVC and IVT are assumed. This leads (with 
5000 iterations over three chains and 1000 burn in) to a posterior mean on a around 
0.38. The 95% credible interval is entirely below 1, suggesting substitution within the 
nest. 


Table 3.6 MNL model for travel choice (Non-nested (IIA) and 
Nested (Non-ITA)) 


Mean 2.5% 97.5% 


ITA 

Log likelihood —87.3 —90.4 —85.9 
TIME òi 0.31 0.10 0.53 
IVC ò —13.3 —16.8 —10.2 
IVT 33 —0.58 —1.21 0.06 
Non-ITA 

Log Likelihood —82.5 —86.1 —80.7 

a 0.38 0.21 0.61 
TIME òi 0.26 0.10 0.44 
IVC 82 —10.2 —13.7 —7.2 


IVT 33 —0.41 —0.86 0.00 
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The coefficients y are reduced (absolutely) as compared to the HA alternative where 
a = 1 by default. While there is a clear gain in fit as measured by log-likelihood, the 
unexpected sign on TIME remains. 

We then allow, in a mixed model approach of (3.14) and assuming IIA, for individual 
specific characteristics X;. This involves creating dummy variables for K — 1 alternatives 
with train as the base category and multiplying each individual covariate by the K — 1 
dummies. Here household income provides a single individual level covariate and this 
involves four extra predictors, with dummies defined for the bus and car options. This 
analysis is based on 5000 iterations over three chains, with 1500 burn-in, and is shown in 
Table 3.7. It can be seen that while the attractiveness of bus and car as against train, 
both increase with income, only the car vs. train choice (measured by y) shows a clear 
positive impact of income. 

One may also at this stage assess whether certain observations are influential or 
outlying, and distorting the model estimates for all observations combined. For assess- 
ing outliers, a procedure suggested by Gelfand and Dey (1994), Weiss (1994) and others, 
avoids the computing burden of single case omission by considering an estimate for the 
CPO statistic using output from an MCMC run of length 7. This is obtained as 


T 
CPO, = T! Geel 
t=1 


where f(y;|0) is the total likelihood for each respondent, totalling over destinations. 
Thus, we find in the mixed model that the CPO statistics to be especially low for respond- 
ent numbers 5, 59, 61 and 80, and one might consider an analysis excluding them. 

To assess the influence of an individual case i, one might compare regression param- 
eter estimates or samples of new data Z based on (a) the posterior p(0|y™”) using the full 
data y“), as against (b) the posterior p(6| Ya) using all the data set except case i, yj. One 
might, for example, compare a certain regression parameter B* by the discrepancy or 
checking function (Weiss, 1994) 


D, * = E(B*| iq) — EB* 


In the transport choice analysis, we might consider the two parameters y; and y, 
describing interaction between mode and income. Other types of discrepancy involve 
predictions of replicate data Z, with distributions under options (a) and (b) 


p(Z|\y) = [oz |0)p(0|y)d0 


Table 3.7 Mixed MNL model, parameter summary 


Mean 2.5% 97.5% 


Log Likelihood —58.9 —63.5 —56.2 
òi TIME —3.2 —4.8 —1.9 
82 IVC —11.0 —14.7 —7.7 
83 IVT —0.88 —1.86 0.11 
Bi BUS —3.32 —6.97 —0.09 
yı BUS x INCOME 0.75 —0.29 1.90 
B2 CAR —19.2 —27.0  —12.6 


y2 CAR x INCOME 1.80 0.69 3.06 
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p(Z ya) = [ool ao 


One might compare the predictions Z; and observations y; for the suspect case i. For 
instance, in the multinomial example here, one might compare the relative distribution 
of the Z;, 1.3 (i.e. proportions in the three travel modes) between the two sampling 
options (a) and (b). One might also compare the likelihoods of Z; under the two options, 
with a discordancy index, such as (Geisser, 1990) 


Dz, = f (Zila) log {f (Ziy) /f (Ziy 


Such comparisons may be computationally intensive, and here we consider the impacts 
of omitting the cases with the lowest and highest CPO as defined above. These are, 
respectively, case 80 and 63, with case 80 opting for car despite lower than average 
income. Using single long runs of 20000 iterations (see Model D in Program 3.7) we 
find, as in Table 3.8, that case 80 has a discordancy index clearly higher than case 63. 
There is also more difference in the relative proportions over the three modes (compar- 
ing in Program 3.7 the samples y.new[80,] and y.new.d[80,] of new data when estimation 
is based on all cases y) and on yy, respectively). There seems less influence in terms of 
the income by mode interaction parameters, though excluding case 80 tends to slightly 
raise the income by car effect. 


Example 3.8 O-ring failures The logit and probit links are the default options for 
categorical choice models. To illustrate the potential impact of alternative links, the 
model of Section 3.3.4 is applied to 23 observations relating to thermal problems on the 
US Challenger missions in 1986. Let y; = 1 if a primary O-ring showed thermal distress, 
and y; = 0 otherwise. The single covariate is the temperature S in Fahrenheit at launch. 
Following Lang (1999), we take this covariate as uncentred, and with m; = P, (vy; = 1), 
obtain a logistic model with posterior mean (and s.d.) parameter estimates 


logit (m;) = 19.2 — 0.29 S 
(9.3) (0.14) 


Following Lang, a Normal prior with mean 0 and variance 5 on À is taken in the general 
link model (3.18). Because of the relatively sparse data, a clear choice of link is unlikely, 


Table 3.8 Checking functions under case omission, cases 63 and 80 


Omitting Case 80 Omitting Case 63 
Mean St. devn. Mean St. devn. 

New Data Discrepancy D., 0.0433 0.1520 0.0000 0.0001 
Difference in Bus by Income Dy), 0.0237 0.7473 —0.0330 0.7845 
Difference in Car by Income Dy, 0.1529 0.9156 0.0211 0.8809 
Relative distribution between modes 
Full data, Z = Train 0.6561 0.4750 0.9999 0.0079 
Full data, Z = Bus 0.3330 0.4713 0.0000 0.0000 
Full data, Z = Car 0.0109 0.1037 0.0001 0.0079 
Data omitting case, Z = Train 0.6442 0.4788 0.9999 0.0079 
Data omitting case, Z = Bus 0.3525 0.4778 0.0000 0.0000 


Data omitting case, Z = Car 0.0033 0.0573 0.0001 0.0079 
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but the analysis leans towards the complementary log-log link, corresponding to the left 
skewed extreme value distribution. 

Three chains are taken to 100000 iterations (with convergence of N apparent at 
around 7000 iterations). The weights on the three components are 0.53, 0.35 and 0.11, 
and the average of à is close to —1, with By = 16.5 and B; = —0.26 (Table 3.9). So there 
is support for a non-logit form of link. Figure 3.2 shows the posterior density of \, based 
on 9000 iterations (every tenth iterate over 30 000 iterations on three chains), with most 
mass between —2 and 0. 

On this basis, we can summarise the relation as 


F\(Bo 8, S) = F_1(16.5 — 0.268) 
By contrast, Lang obtains 
Fy(Bo BiS) = F-143(14.7 — 0.23S) 


As a check to model form, a Bayes predictive check (Gelman et al., 1995) is used, based 
on comparing a chi-square measure for the observed data x?(Z;, 7;) and for replicated 
data, namely X (Zip mi). This is estimated at 0.27, and shows no evidence of lack of fit. 


Example 3.9 Ordinal ratings in diagnostic tests As an illustration of Bayesian ordinal 
regression, we consider an issue in medical diagnosis. The performance of diagnostic 
tests is measured by various probabilities: the probability of correctly assessing a 
diseased individual is known as the sensitivity (i.e. true positive rate) of the test, and that 


Table 3.9 Choice of link: O-ring data, parameter 
summary 


Mean St. devn. 2.5% 97.5% 


Bo 16.52 8.19 3.65 36.12 
Bi —0.26 0.12 —0.55  —0.06 
N —0.99 1.78 —4.69 2.72 

i- o-2 


Figure 3.2 Smooth of N 
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of correctly classifying a healthy individual is the specificity. The false positive rate 
(chance of classifying someone healthy as ill) is 1 minus the specificity. 

Assume outcomes 7; on subjects are available as a rating scale with K levels, with 
category | corresponding to the lowest indication of disease and category K to the 
highest (e.g. for K = 3 we might have 1=normal, 2=equivocal, 3=abnormal). As in 
Section 3.3.3, K — 1 cut points 6; are assumed so that an individual will fall in category 
1 if the underlying scale value for that subject z; is less than 01, in category 2 if 
0, < z; < 02, and so on, and in category K if z; > 0@x_1. 

The performance of a test depends upon the threshold chosen on the latent scale z. 
Lowering the threshold will improve sensitivity, but at the expense of more false 
positives. The relation between the true and false positive rate as the threshold varies 
is known as the Receiver Operating Characteristic (ROC) curve. A parametric approach 
to estimating this can be derived from signal detection theory, and leads to a smooth 
ROC curve. However, this approach does not adapt to include covariates (e.g. age or 
disease stage) which might affect the test accuracy. 

Assume, therefore, that p covariates are available on each subject, and that the first 
of these is a dummy index for true disease state (x; = 1 for diseased cases i), so that 
there is no intercept. Then following Tosteson and Begg (1988), we may specify the 
cumulative probability y;(x;) of response up to and including category j for a subject 
with covariates x;. 


F! {yœ} = (0; — Bxi)/ exp (x) (3.19) 


Taking F as the Normal distribution function amounts to assuming the latent variable 
zis Normal with mean Bx; and standard deviation exp (6x;). Other links are possible, 
such as the logit, but interpretation of the B parameters changes. When the first 
covariate x; is true disease status and x, is set at 1, the true positive rate or sensitivity 
for a subject with rating is 


1 — yj) 


since y(xı i) with xı; = 1 is the probability that a diseased subject is ranked in one of the 
diagnostic groups 1,..j. 

The basis for making the variance of the latent scale depend upon covariates is 
likely to be governed by the application area. Tosteson and Begg (1988) consider the 
detection of hepatic metastases by ultrasonography, and cite evidence that the spread 
of responses in radiological rating data is wider in diseased than non-diseased subjects. 
This difference would be modelled by the parameter 6;. Their analysis and a reanalysis 
by Peng and Hall (1996) concern the accuracy of ultrasonography ratings on a five 
point scale (see Appendix Table 1). These ratings are used in assessing the presence 
or otherwise of metastases in patients. Ratings are related to true metatstasis x1, 
and cancer site (either breast cancer, with x. = 1, or colon cancer, with x2 = 0). 
A third predictor is the interaction x3 = x,x2 between true metastasis status and cancer 
type. 

A ‘full model’ includes all three covariates in both B and 6 in Equation (3.19). A logit 
link is adopted for numerical stability — a probit link would best be implemented with 
the Albert—Chib (1993) sampling method. The priors on the cut points are constrained 
to be increasing, with 


0; ~ N(j, 10) 2(8;-1, 9; 1) 
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The logit link full model is not well identified (cf. Tosteson and Begg), but suggests that 
variability is greater for metastasis cases (for whom x; = 1) and breast cancer cases for 
whom x2 = 1 (see Table 3.10). These findings are in line with Tostesan and Begg (1988, 
Table 2), though they report different signs on B and B; than obtained here (though 
neither effect is significant in their results). The sensitivity of the scale runs at 0.75 or 
more across the possible ranks. 

A reduced model, involving just x, the true status, in defining B and 6, shows a clear 
association between true status and the diagnostic classification. With a run of 10000 
iterations from three chains (and 2500 burn-in), the parameter Bı averages 1.9 and has a 
95% interval consisting only of positive values. However, the effect of metastasis on 
variability in ultrasonography ratings (as measured by ıı), is no longer present. As in 
the full model, the latent scale cut points 0; suggest that upper ranks on the scale could 
be merged. 


3.4 REGRESSIONS WITH LATENT MIXTURES 


Chapter 2 considered finite mixtures to describe heterogeneity in which latent classes 
j=1,.., J differ only in their means p, and other population parameters (e.g. variances 


Table 3.10 Ultrasonography ratings parameters 


Full model Mean St. devn. 2.5% Median 97.5% 


Influences on diagnostic rank 


By 3.86 1.61 1.60 3.53 7.79 


B2 —0.62 1.55 —4.34 —0.23 1.48 
B3 1.00 2.45 —3.09 0.73 6.32 


Influences on variability 


òi 0.76 0.60 —0.51 0.79 1.84 
8 1.30 1.25 —0.52 1.15 4.52 
83 —1.65 1.67 —5.35 —1.46 1.19 
Cutpoints 

0, 0.35 0.21 0.02 0.32 0.81 
0 0.99 0.17 0.65 0.99 1.33 
03 1.07 0.16 0.75 1.07 1.39 
04 1.11 0.16 0.79 1.10 1.43 


By 1.88 0.44 1.18 1.82 2.88 
òi —0.26 0.41 —1.12 —0.25 0.51 
oT 0.31 0.19 0.02 0.29 0.71 
02 0.87 0.15 0.59 0.87 1.16 
03 0.94 0.14 0.68 0.94 1.23 


04 0.98 0.14 0.71 0.97 1.26 
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in a Normal example). Finite mixture regressions introduce covariates into either the 
determination of the latent class indicators or to describe the relation between the mean 
of subject 7 on each latent class m; and that subject’s attribute profile. Mixture regres- 
sions have been applied to modelling the behaviour or attitudes of human subjects 
so that each individual has their overall mean determined by their membership 
probabilities (Wedel et al., 1993). 

Thus, for univariate Normal observations y;, a p-dimensional vector of predictors X;, 
define latent indicators z; of class membership among possible classes j = 1,..J. Were 
the indicators known, 


yilzi =j ~ N(B;Xi T) (3.20) 


where B, is a class specific regression vector of length p, and 7; is the conditional 
variance. If \ = Pr(z; = j) then the overall mean for subject i is 


Aiba AnBie MiMi 


where pj; = B;X:. The indicators z; may be sampled from a multinomial without 
additional covariates, so that the multinomial has parameters \, = àj. Alternatively, 
an additional regression seeks to model the z; as functions of covariates W; such that the 
multinomial is defined by parameters 


hy = exp(b,Wi)/[1 X exp Wi] i >1 
k=2 


as in Section 3.3.1. 

Several applications of regression mixtures have been reported in consumer choice 
settings: Jones and McLachlan (1992) consider metric outcomes y, namely consumer 
preference scales for different goods, which are related to product attributes (appear- 
ance, texture, etc.). They find sub-populations of consumers differing in the weight they 
attach to each attribute. Here the multinomial logit regressions for z; might involve 
covariates W such as consumers’ age, class, or type of area of residence, while the 
modelling of the y; might involve covariates X describing the quality or price of goods. 

Binomial, ordinal or multinomial mixture regressions have utility both in represent- 
ing departures from the baseline model assumptions (e.g. overdispersion), as well as 
differential regression slopes between sub-populations (Cameron and Trivedi, 1986). 
For example, Wedel et al. (1993) argue for using a latent class mixture in an application 
involving Poisson counts y, both because of its advantage in modelling differential 
purchasing profiles among customers of a direct marketing company, and its potential 
for modelling over-dispersion in relation to the Poisson assumption. 

As mentioned in Chapter 2, there are the usual problems in Bayesian analysis (as in 
frequentist analysis) concerning the appropriate number of components. Additionally, 
Bayesian sampling estimation may face the problems of empty classes at one or more 
iterations (e.g. no subjects are classified in the second of J = 3 groups) and the switching 
of labels unless the priors are constrained. On the other hand, the introduction of 
predictors provides additional information that may improve identifiability. To counter 
label switching we might apply a constraint to one or more of the intercepts, regression 
coefficients, variance parameters, or mixture proportions that ensures a consistent 
labelling. In some situations one may be able to specify informative priors consistent 
with widely separated, but internally homogenous groups (Nobile and Green, 2000). 
These ensure (a) that different groups are widely separated, for instance a prior on 
intercepts Bo; when J = 2 might be Bo; ~ N(— 5, 2), Bor ~ N(O, 2), effectively ensuring 


112 REGRESSION MODELS 


separation, and (b) that if cases do fall in the same group, they are expected to be 
similar. 

An alternative to constrained priors involves re-analysis of the posterior MCMC 
sample, for example by random or constrained permutation sampling (Fruhwirtth- 
Schattner, 2001). Suppose unconstrained priors in model (3.20) are adopted, and 
parameter values oP = (B®, a} are sampled for the nominal group j at iteration 
t. We may investigate first whether — after accounting for the label switching 
problem — there are patterns apparent on some of the parameter estimates which 
support the presence of sub-populations in the data. Thus, if there is only p= 1 


predictor and the model is 


Vier ZNC; Tj) 


(3.21) 
Wj; = Bo Baxi 


then a prior constraint which produces an identifiable mixture might be By; > Boo, or 
Bi > Bio Or T1 > m2. (Sometimes more than one constraint may be relevant, such as 
Bor > Boo, and By, > B12). Fruhwirtth-Schattner proposes random permutations of the 
nominal groups in the posterior sample from an unconstrained prior to assess whether 
there are any parameter restrictions apparent empirically in the output that may be 
associated with sub-populations in the observations. 

From the output of an unconstrained prior run with J = 2 groups, random permuta- 
tion of the original sample labels means that the parameters nominally labelled as 1 at 
iteration ¢ are relabelled as 2 with probability 0.5, and if this particular relabelling 
occurs then the parameters at iteration ¢ originally labelled as 2 are relabelled as 1. 
Otherwise, the original labelling holds. If J =3 then the nominal group samples 
ordered {1, 2, 3} keep the same label with probability 1/6, change to {1, 3, 2} with 
probability 1/6, etc. 

Let 6, then denote the relabelled group j samples for parameters k = 1,.., K. (A 
suffix for iteration ¢ is understood.) The parameters relabelled as 1 (or any other single 
label among the j = 1,..J) provide a complete exploration of the unconstrained par- 
ameter space, and one may consider scatter plots involving ik against 81m for all pairs 
k and m. If some or all the plots involving 61, show separated clusters, then an 
identifying constraint may be based on that parameter. To assess whether this is 
an effective constraint, the permutation method is applied based not on random 
reassignment, but on the basis of reassignment to ensure the constraint is satisfied at 
all iterations. 


Example 3.10 Viral infections in potato plants Turner (2000) considers experiments in 
which viral infections in potato plants were assessed in relation to total aphid exposure 
counts. The experiment was repeated 51 times. The data are in principle binomial, 
recording numbers of infected plants in a 9 x 9 grid with a single plant at each point. 
However, for reasons of transparency, a Normal approximation involving linear regres- 
sion was taken. The outcome is then just the totals of plants infected y, which vary from 
0 to 24. A plot of the infected plant count against the number of aphids released (x) 
shows a clear bifurcation, with one set of (y, x) pairs illustrating a positive impact of 
aphid count on infections, while another set of pairs shows no relation. 


Here we consider two issues: the question of possible relabelling in the MCMC 
sample, so that samples for the nominal group 1 (say) in fact are a mix of parameters 
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from more than one of the underlying groups. Secondly, we consider possible assess- 
ments of the number of groups. 

A two group mixture, with means and conditional variances differing by group, was 
accordingly well identified by Turner and his bootstrap analysis showed marked 
improvement in fit over a single group model. As discussed in Chapter 2, the bootstrap 
is often applied when the usual likelihood ratio tests do not follow standard asymptotic 
densities. Suppose one wished to assess the gain in fitina J 1 vs. J component model, 
with Q° denoting the observed likelihood ratio comparing (say) a three group to two 
group model, with parameters ĝo) and ĝo, respectively. Then one might sample S 
repeated data sets from the parameters 0o) and derive the likelihood ratio statistic Q, 
for each, and find the proportion of sets where Q° exceeds Q,. This is the bootstrap test 
criterion, as opposed to the nominal test based on asymptotic results. 

First we run the model (3.21) for J = 2 and K = 3 (intercepts Boj, slopes Bij, and 
variances T;). Using the output from 20000 iterations with a single chain, 500 iteration 
burn-in, and starting from null starting values, we apply (e.g. in a spreadsheet) the 
random permutation sampler of Fruhwirtth-Schattner (2001). The relabelled group 
1 parameters (every tenth sample) are then plotted against each other (Figures 3.3a—c). 
Both plots involving the slope, namely Figures 3.3a and 3.3b, show well separated 
clusters of points, suggesting an identifiability constraint on the Bj;, such as Bj; > By. 
Applying the constrained permutation sampler, again to the output from the uncon- 
strained prior, shows that this is an effective way of identifying sub-populations, and 
one can either use the output from the constrained permutation to derive parameter 
estimates or formally apply the constraint in a new MCMC run. From the constrained 
permutation sampler, the following characteristics of the two groups are obtained: 


Bor Boo Bi Bis 1/7 1/72 
Mean 0.868 3.324 0.002 0.055 0.900 0.096 
2.5% 0.115 1.047 —0.003 0.040 0.361 0.043 


97.5% 1.733 5.593 0.007 0.069 2.117 0.170 


Slope 


=0-02 
Intercept 


Figure 3.3a Plot of intercept vs. slope for relabelled first group iterations 
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Figure 3.3c Intercept against precision for relabelled group 1 iterations 


In the present application, it was not possible to identify a stable solution with three 
groups that provided an improvement in log-likelihood over the two group solution. 
This was apparent both from MCMC sampling with a constraint on the slopes B,; and 
using the permutation sampler on the output from a three group model without a prior 
constraint. Plots of the intercepts on the slopes from the relabelled parameter iterates 
Bick = 1, 3) showed no distinct groups for J = 3, but instead just ‘scatter without 
form’. 
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It remains to demonstrate the gain in a two over a one group model. This might be 
possible to demonstrate using penalised fit measures such as the AIC statistics. Thus, 
parallel sampling of a one and two group model (the latter with constrained prior) could 
be carried out, and the proportion of iterations assessed where the AIC for the two 
group model improved over that for the one group model. Note that the number of 
parameters are usually taken as p; = 3 and pz = 7 in the one and two group models, 
respectively. Other options (see Raftery, 1995) might involve marginal likelihood 
approximations leading to approximate Bayes factors. Another possibility is Bayes 
sampling analogues of bootstrap and Monte Carlo test procedures (see Dey et al., 1998). 

Here a marginal likelihood approximation is obtained by monitoring the inverse 
likelihood g® = 1/f(y;:|0®) for each case. The inverse of the posterior mean of g\ is 
an estimate of the CPO for case i, and the sum of the logarithms of these estimated 
CPO; is an estimate of the log marginal likelihood of the model. Here this procedure 
gives log marginal likelihood estimates of —162.9 and —140.5 under the one and two 
group models, and a clear preference for the two group model. 


3.5 GENERAL ADDITIVE MODELS FOR NONLINEAR REGRESSION 
EFFECTS 


A generalisation of the smoothing prior approach of Chapter 2 is to generalised additive 
models in regression. Such models provide an approach to modelling possible 
nonlinearity, but avoiding the need to specify complex algebraic forms. Thus for a 
metric outcome y1, ..Yn, assume there are corresponding values of a regressor variate 
X1,...X, ordered such that 


Xp XQ <1. Xy 


The model for the observations may then be 


Yye= Bo fx) & 


where s; ~ N(0, o°). Let g, = f (x:) be the smooth function representing the changing, 
possibly nonlinear, impact of x on y as it varies over its range. As in Chapter 2, it is 
common to assume Normal or Student random walks in the first, second or higher 
differences of the g;. A variant on this is when the smooth in the variable x modifies the 
effect of a predictor z, with 


Vi=Bo Uf) & 


If x, = t denotes time and y; is a time sequenced response, then a dynamic coefficient or 
state space model is obtained, with 


Ye = Bo 2B & 


It will commonly be the case that the x, are unequally spaced, and it is then necessary in 
specifying the prior for g; (or B,,) to weight each preceding point differently. This means 
adjusting the precision such that wider spaced points are less tied to their predecessor 
than closer spaced points. Thus, suppose the x; were irregularly spaced and that the 
spaces between points are 6) = x2 — x1, 82 = X3 — X0,... 8)-1 = Xn — Xn-1. A first 
order random walk smoothness prior, with Normal errors, would then be specified as 


gi ~ N(g-1, 8,7”) 
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and a second order one would be 
ge ~ NY, 8,7”) 


where v; = g,-1(1 8, /8)-1) — g+-2(8,/8;-1) (see Fahrmeir and Lang, 2001). If there is 
equal spacing then the first and second order random walk priors are just 


gi ~ N(gi-1, T’) 
gi ~ N81 — 81-2, T) 


With smooths on several regressors x1;, X2;, X3;.-1t will be necessary to supply an 
ordering index on each one at observation level O17, Oz, O3;, etc. A frequent situation 
is the semi-parametric additive form, with smooths on a subset of p regression variables, 
with the remainder modelled conventionally. Quite often there would be just a single 
regressor with a general additive form and the remainder included in a conventional 
linear combination. It may be noted that results may be sensitive to the priors assumed 
for the initial smoothing values (e.g. g; in a first order random walk) and for the 
evolution variance 7”. This is especially so for sparse data, such as for binary outcomes 
y,- In BUGS convergence may benefit from sampling 77 directly from its full conditional 
density. 

Other smoothness priors have been proposed. For example, Carter and Kohn (1994) 
consider the signal plus noise model 


Yt5 gt & 
where the g, are generated by a differential equation 

dg be dW, 

de" dt 


with W, a Weiner process, and 7 the smoothing parameter. This leads to a bivariate 
state vector s; = (gt, dks) in which 


Sı = Fis) uy (3.22) 


18 
r= (9 4) 


Also, the u; are bivariate Normal with mean zero and covariance t? U;, where 


y, = {%/3 8/2 
o (8/2 ò; 


where, with 6, defined as above, 


Wood and Kohn (1998) discuss the application of this prior with binary outcomes, 
which involves applying the latent variable model of Albert and Chib (1993). 


Example 3.11 Kyphosis and age Consider the kyphosis data from Hastie and Tib- 
shirani (1990) on 81 patients receiving spinal surgery (Appendix Table 2). The binary 
outcome y; relates to the post-surgical presence or otherwise of forward flexion of the 
spine from the vertical. As mentioned above, this is a sparse data form and so results of 
additive smoothing may be sensitive to priors assumed. 
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Risk factors taken to influence the outcome are xı; =the number of vertebrae level 
involved, x2, =the starting vertebrae level of the surgery and x3, =age in months. 
Initially, a probit model with linear effects in X; = (x1;, X2;, x31) is applied using the 
latent dependent variable method. Thus 


Zi ~ Nu, 1) Z(0, ) if y= 1 
z~ N, DIGO  ify=0 


with mean 


W; = Bo Bixi BoX2 Bax 


N(0, 10) priors are set on Bo, Bı, and By, but an N(0, 1) prior on B; as it is applied to 
ages which exceed 200, and too vague a prior may lead to numeric problems as a large 
coefficient is applied to a large age value. This analysis shows clear positive effects for xı 
and a negative effect for x2, but a less clear positive (linear) effect of age. 

To clarify possible nonlinear effects, the impact of age is instead modelled via a 
general additive form. Thus, sampling of the z; is defined as above, but with mean now 


M = Bo Pixi Borax Baxs f xX3:) 


In defining the smoothness prior for f(), the age variable is grouped according to the 
Ng = 64 distinct values, as in Appendix Table 2, which includes both the original age and 
the age group index. The differential equation prior (3.22) is used, sampling o? = 1/7” 
from the full conditional specified by Carter and Kohn (1994, p. 546). Prior values of 
0.001 for are used for index and shape parameters in the gamma full conditional density 
for o?. Using the initial parameter values as in Program 3.11 convergence in 7? is 
apparent after 5000 iterations, and the posterior summary is based on iterations 
5000-50 000. 

The evidence from the smooth (Figure 3.4) is of a clear nonlinear effect in age, 
reaching a maximum at ages between 90 and 100 months. The linear age effect is 
eliminated, but the values of the regression coefficients on the other covariates are 
somewhat enhanced in the general additive model. 


ao MMe 


50 100 150 ° 200 250 


Age 


Figure 3.4 Smooth for age 
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Table 3.14 Kyphosis outcome, standard Probit and general additive models 


Probit regression Mean St. devn. 2.5% Median 97.5% 
Deviance 64.9 27 61.6 64.3 71.8 
Intercept —1.130 0.744 —2.615 —1.102 0.277 
Number 0.229 0.110 0.020 0.227 0.448 
Start —0.125 0.038 —0.202 -—0.125 —0.053 
Age 0.006 0.004 —0.001 0.006 0.013 


Additive form on age 


Deviance 59.4 4.1 52.1 58.8 70.8 
Intercept —1.337 1.206 —4.301 —1.295 1.270 
Number 0.249 0.127 —0.033 0.247 0.553 
Start —0.147 0.045 —0.255  —0.146 —0.046 
Age (Linear) 0.004 0.013 —0.024 0.004 0.033 
T? 0.0013 0.0011 0.0002 0.0009 0.0057 


The deviance of the GAM model improves only slightly on the linear model, despite 
being more heavily parameterised; the penalised fit (as may be verified) therefore 
deteriorates. The plot of the smooth suggests a low order polynomial might be sufficient 
to model the nonlinearity, and this would be less heavily parameterised than a GAM. 


3.6 ROBUST REGRESSION METHODS 


Statistical modelling based on the Normal distribution is often the default option in 
regression with metric outcomes, or in models including hierarchical random effects, 
when outcomes are discrete. One approach seeks to identify outliers in terms of poster- 
ior probabilities that an absolute value of an estimated residual is surprisingly large (see 
Chaloner, 1994 and Example 3.12). However, instead of adopting Normality and then 
seeking possible outlier observations inconsistent with Normality, an alternative is to 
embed the Normal in a broader model set. This larger set may be defined by an extra 
parameter (or parameters) that afford resistance or robustness to non-Normality. 
Alternatively, the Normal can be set within a range of alternative densities, allowing 
heavier or (less commonly) lighter tails than the Normal. Thus Bayesian approaches to 
robustness in metric outcomes include the ¢ density, and contaminated Normal densities 
that explicitly allow some observations to be outliers (Verdinelli and Wasserman, 1991). 

Under the Student ¢ density, resistance to outliers is accommodated by varying the 
degrees of freedom parameter. As considered in Chapter 2, introducing this extra 
parameter is in fact still equivalent to the Normal but with a variable scale parameter 
for each observation. The associated weight that may be used to indicate outlier status 
in relation to the overall assumed density or regression model. 

Suppose the data consists of univariate metric outcomes y1, i= 1,.., n and ann x p 
matrix of predictors. Then consider a Student f regression model for the means p; = Bx; 
with variance o? and known degrees of freedom v. Assuming the reference prior (Gel- 
man et al., 1995) 


TB, o°) xo! 
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the posterior density is proportional to 
oe YTTR i- Bx ory? 2? 
i=l 


Similarly, if the outcome y is multivariate Student t of dimension q with dispersion $, 
and 7(B, >) x |S|~', the posterior is proportional to 


a D) K 1 di -0 4)/2 
| [| 5 (91 = Bx! — Bx) 
i=l 


where È is a q x q dispersion matrix. 

The equivalent scale mixture specification in either case involves unknown weight 
parameters w; that scale the overall variance or dispersion parameter(s) of the Normal. 
Thus, for a univariate outcome, the Student t model may be expressed as 


yi=Bxu € 
e; ~ N(O, o7/w,) (3.23) 
w; ~ G(v/2, v/2) 


The multivariate version of this takes again w; as G(v/2, v/2), and takes the ith vector 
observation y; to be sampled from a multivariate Normal with dispersion matrix 


>i = >/0; 
Suspect observations (i.e. potential outliers) with small weights œw; and large distances 
yi — Bx; from the regression model have effects 
(i = Bx); i = Bx) 
on the posterior density down-weighted, with the degree of down-weighting usually 
being enhanced for smaller values of v. 

Other densities with heavier (or possibly lighter) tails than the Normal are obtained 
with alternative densities for the w; or other types of mixing of densities. For instance, 
Smith (1981) considers a model for calculating marginal likelihoods, which involves 
choosing between the Normal, Double Exponential or Uniform densities, with prior 


probabilities of 1/3 on each density. The M-estimators of Huber (1981), which form the 
base for much work on robust estimation, are based on densities 


P(y|p,o,k) œ exp[— U(d)] 
where d = (y — p)/o, and 

U(d)=0.5d if |d|< x 

U(d) =x|d|—0.5«? if |d|/>« 


As « tends to infinity, this form tends to the Normal while for k near zero it approxi- 
mates the double exponential. 


3.6.1 Binary selection models for robustness 


A variant of the scale mixture approach for metric dependent variables takes w as 
binary with probability A = Pr(w = 1) of selecting a Normal density with mean p and 
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standard deviation o. On the other hand, if w = 0, then an overdispersed alternative is 
selected with the same mean but standard deviation ka, where k considerably exceeds 1. 
For example, taking \ to be small, e.g. A ~ U(0, 0.1) and k ~ U(2, 3) allows protection 
against a low level of contamination (of up to 10% of the observations) and variance 
inflation in that contaminated component of between four and nine times the overall 
level. Setting \ to a very low level, e.g. \ = 0.01, and x positive with unrestricted ceiling, 
allows for a small number of extreme outliers. 

Outlier resistant models may be seen as allowing for measurement error or (for 
categorical outcomes) misclassification. Thus in regression for binary outcomes, an 
outlier in the response y can be seen as possibly due to a transposition 0 — 1 or 
1 — 0. These two cases occur when (a) the observed outcome y; = 1 despite a model 
probability for y = 1 being close to zero, i.e. 


Pr( y; = lx) = 7; © 0 


and (b) when y; = 0 despite m; being close to 1. If residuals are defined as n; = yi — Ti, 
then an outlier is indicated if the absolute residual is close to 1. The sensitivity of a 
binary regression to outliers in part depends on the assumed link (e.g. the logit and 
complementary log-log links have heavier tails than the probit). 

Regardless of link, a model for resistant binary regression may also be specified, 
including a mechanism for transposition between 0 and 1. Let y; be the recorded 
response, and also define 7; = Pr(j; = 1|x;), where J; is the true response. Following 
Copas (1988), assume a transposition'! occurs with a small probability y such that the 
probability of the actually recorded response being 1 is given by 


Tmi = Pr( y; = 1|xi) 
= Pr(y; = 1|9; = PrO: = O|x;) Pr(yi = 1]; = DPrO; = 1x) 
=(1— y); yU- ñi) (3.24) 
where (for example) 
logit(#;) = Bx; 
The likelihood ratio that y is an outlier (or more particularly, misrecorded as a result of 


transcription) as against it belonging to the data (or being a genuine observation) is then 
given by 


R=yi(l—7)/7a; (1 yii — T) (3.25) 


3.6.2 Diagnostics for discordant observations 


The search for robust or resistant fits in general linear models extends to consider 
outlying points in the design space (of the X variables) as well as outlying responses (y). 
Logistic models, binary or multiple, may be especially sensitive to such outliers. 
Adjusting for such outliers may help to avoid unjustified rejection of models in their 
entirety, or of particular explanatory variates within models, because of distortions due 
to a few unusual data points. 

11 In BUGs this involves the coding, for a single predictor x[i]: 

yli] ~ dbern(pstar[i]) 


pstar[i] <- (1-gamma)*pļi] gamma*(1-p[i]) 
logit(p[i]) <- beta[1] xļi]*beta[2] 
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Methods for outlier detection or for down-weighting influential observations may be 
based on appropriately defined residual terms (Copas, 1988). For example, for a logit 
regression for binary outcomes with probability 7;, the components of deviance are 


di =—2log(1—7;) when y; =0 
— 2log(a;) when y; = 1 


and taking ê; = y; — E(y;|x;) = yi — 7; in line with a general definition of residuals gives 
di = —2log (1 — |g;|) (3.26) 


One may modify the usual likelihoods or deviances to downweight influential observa- 
tions, via influence functions g(u) which penalise large values of u (Pregibon, 1982). 
Whereas maximum likelihood estimation is equivalently minimisation of };d;, where d; 
is the deviance component of the ith case, robust estimation instead minimises Ł;g(d;), 
where g(d;) < d; for large d;, so lessening the influence of cases with large deviances. The 
g(d;) may be based on the estimated residuals, as in Equation (3.26). 

Another approach to obtaining residuals from a binary regression involves the latent 
variable method of Albert and Chib (1993), which is equivalent either to probit or logit 
regression: for example, if cases have different and known weights under the probit 
option, then a homogenous variance of 1 may be replaced by variances (inverse of the 
weights) averaging 1. 

One may also consider the effect of each observation on the fitted model; for example, 
regression coefficients may be sensitive to particular points with unusual configurations 
of design variables, xj, x2, .. Xip. Thus estimates of a coefficient B when all cases are 
included may be compared with the same coefficient estimate B[7] when case 7 is 
excluded (Weiss, 1994; Geisser, 1990). The differences 


AB; = BI] — B 
may then be plotted in order of the observations. A cross-validatory approach to model 
assessment omitting a single case at a time therefore has the advantage not just of 
providing a pseudo marginal likelihood and pseudo Bayes factor, but of providing a 
measure of the sensitivity of the regression coefficients to exclusion of certain observa- 


tions. One may obtain posterior summaries of the AB;, ascertain which are most clearly 
negative or positive, and so produce the most distortion of the all cases estimate B. 


Example 3.12 Group means from contaminated sampling Chaloner (1994) and Chal- 
oner and Brant (1988) discusses the identification of outliers at two levels within a one 
way analysis of variance problem. Specifically, Chaloner (1994) considers an outlier as 
an observation with residual e; exceeding a threshold k appropriately defined. For 
Normal data with n observations, and setting 


k= 0110.5  0.5(0.95!/")} 


ensures that the prior probability of no outliers is 0.95. Chaloner then analyses data 
from Sharples (1990) using both inflated variance and mean shift contaminated mixture 
models for outliers (see Table 3.15 for the data concerned). For a one-way data set with 
I = 5 groups (and n; = 6 observations within the ith group), within group errors were 
sampled with probability 0.1 from a gamma distribution with mean 5.5. In addition the 
group means used to generate the data were 25 for groups 1—4, but 50 for group 5. 
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Table 3.15 Data generated with contamination process (Sharples, 1990) (Outliers starred) 


Observations in Group 


Group 1 2 3 4 5 6 Average 
1 24.80 26.90 26.65 30.93 33.77 63.31* 28.61 
2 23.96 28.92 28.19 26.16 21.34 29.46 26.34 
3 18.30 23.67 14.47 24.45 24.89 28.95 22.46 
4 51.42* 27.97 24.76 26.67 17.58 24.29 24.25 
5 34.12 46.87 58.59* 38.11 47.59 44.67 42.27 


Three of the 30 observations are identified as a priori outliers (known to be sampled 
from the contamination gamma density). The question is then to identify the probability 
of outliers among the data y,(i=1,.., nj; j = 1,..,J) and among the group means. 
Specifically, suppose 


Yy ~ NO, T7 
0; ~ N(w, T5) 
Then the first and second stage residuals are defined as 


Ej = (Yy — 0;)/Tw 


and 
g; = (0; — w)/t 


and compared to k above. If Pr(e; > xK) or Pr(e; > x) exceeds the prior probability of 
0.05, then an outlier is indicated. Prior specification of 77, and 7; is important, as certain 
priors may allow excessive shrinkage (7; too small). Note that the observed between 
group variance is around 60. 

Consider first a prior (prior A) for the overall variance 1? = 72, 7; and then the 
proportion assigned to within group variation decided by a ratio m with a flat beta prior. 
Prior B is used by Chaloner, namely 

Pr(72 7) x T In 


w? w 


which in BUGS involves a double grid prior scaled to ensure total mass of one. The grid 
takes account of the actual value of the between group variance to the extent that small 
overall variances (below 5) are excluded. Finally, prior C is a double grid prior with 
equal probability over the pairs of values of 7?, and 72 in the grid. 

Whatever the prior, it is clear that the likelihood for 7; is relatively flat, but prior C 
results in a higher mean estimate for between group variance than the other two. 
Despite this, probabilities of individual observation outliers are relatively similar 
regardless of the prior adopted as are the estimates of the true group means 01, 02,.., 05. 
The probabilities that yı,6 and y4,; are outliers are, respectively, {0.38, 0.028), 
{0.42, 0.037} and {0.35, 0.031} under the three priors, whereas Chaloner, who uses a 
Normal approximation in conjunction with a Laplace approximation, finds values of 
0.44 and 0.05. Priors A and B give a posterior probability of group 5 being an outlier of 
around 0.05 to 0.06; this compares to 0.019 cited by Chaloner. 
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Example 3.13 Stack loss To illustrate the effect of robust alternatives to the Normal 
for metric outcomes based on the Student ¢ density, consider n = 21 points in the classic 
data set for stack loss, y and predictors x; = air flow, x2 = temperature and x3 = acid. 
A simple Normal errors model gives an estimated equation, with posterior means and 
standard deviations 


y=—43.6 0.72x, 1.28x— 0.11.x3 
(10.5) (0.12) (0.32) (0.13) 


Lange et al. (1989) show a small improvement in likelihood in moving from a Normal 
(degrees of freedom v = oo) to a Cauchy density (v = 0.5), with the maximum likeli- 
hood estimate of the degrees of freedom provided by v = 1.1. Here an exponential prior 
for v is assumed with mean n, which is itself assigned a uniform prior’? between 0.01 
and 1. It may be noted that a proper prior is needed on v to avoid relapsing to the 
Normal (see Geweke, 1993, p. S27). 

Adopting this approach (Model B in Program 3.13) gives a median value of around 
1.7 for v, with the credible interval ranging from 0.5—60. The estimated equation is now 


y=-39.4 0.81x,  0.71x2 — 0.093 
(7.6) (0.12) (0.37) (0.11) 


so that the coefficient on x2 is considerably reduced. 

We then take the scale mixture approach to the Student ¢ but with v known (as 1.7). 
This shows the lowest weights (w; as in (3.23)), namely 0.78 and 0.67, for observations 4 
and 21. Adopting the MLE value of 1.1 instead gives the lowest weights (all under 0.25) 
on observations 3, 4 and 21 (cf. Lange et al., 1989, p. 883). 

Finally Model D in'* Program 3.13 also uses a scale mixture approach, but with v 
also unknown. The outcome of this model, from a run of 10000 iterations, is a median 
for v of 1.3, and weights of 0.14 and 0.10 on observations 4 and 21. The coefficient on x2 
is further reduced to average 0.61 (compare Lange et al., 1989, Table 1). 


Example 3.14 Leukaemia survival To illustrate the transposition model (3.24) for a 
binary outcome, we consider the leukaemia data of Fiegl and Zelen (1965), but with 
response for n = 33 subjects being defined according to whether they survived for a year 
or more (y = 1 for deaths at more than a year). The two covariates are x; = white blood 
cell count and x2 = positive or negative AG (presence or absence of certain morphologic 
characteristics in the white cells). A standard logistic regression shows both covariates 
negatively related to extended survival (Table 3.20), with the WBC coefficient being 
—0.04. 


In adopting the alternative model allowing for possible contamination, it is to be noted 
that Copas (1988, p. 245) found by maximum likelihood methods that a value of y = 0.003 
in (3.21) leads to a large increase in the absolute size of the WBC coefficient. This means 
the predicted longer term survival chances of patients with low WBC counts are 


12 This is broadly equivalent to assuming the degrees of freedom is between 1 and 100. 
13 Run in version 1.2. 
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increased. However, further increases in y had a relatively small impact. Therefore, a 
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Table 3.20 Leukaemia data, logit model and logit under transposition, 


parameter summary 


Mean St. devn. 2.5% Median 97.5% 
Standard Model 
Intercept 1.073 0.708 —0.231 1.049 2.538 
WBC —0.041 0.022 —0.092 -—0.038 —0.005 
Negative AG —2.49 1.036 —4.688 —2.432 —0.601 
Transposition Model 
Intercept 3.68 2.04 0.32 3.39 11.07 
WBC —0.32 0.18 —0.89 —0.30 —0.02 
Negative AG —3.38 1.89 —10.47 —3.12 —0.30 
y 0.0066 0.0049 0.0003 0.0055 0.0228 
Likelihood Ratio (Misrecording vs. Genuine) (see equation (3.22) ) 

Mean Median Mean Median 
D; 0.1268 0.0689 Dis 53.7300 2.6940 
D» 0.0969 0.0433 Dio 48.4700 1.7380 
D3 0.1906 0.1263 D 0.6457 0.4211 
D; 0.1342 0.0753 Dy; 1.8150 0.9181 
D; 18.82 4.73 Dz 0.1486 0.0917 
De 1.2400 0.7816 Dz 0.4155 0.2796 
D; 1.0180 0.6772 Dz 0.1177 0.0680 
Ds 0.5933 0.1937 D25 0.0260 0.0047 
Do —39.61 5.65 Dzs 0.0120 0.0005 
Dio 0.3770 0.2835 Dz 0.0111 0.0004 
Di; 0.8145 0.5736 Dys 0.0091 0.0002 
Di2 0.0970 0.0022 Dz 0.0130 0.0007 
Di3 0.0809 0.0010 D30 0.0206 0.0026 
Di4 0.0114 0.0001 D3; 0.0017 0.0001 
Djs 0.0114 0.0001 D32 0.0011 0.0001 
Dio 0.0383 0.0001 D33 0.0011 0.0001 
Di7 8966 9999 


prior on y may be set drawing on this analysis, and specifically 


A three chain run of 25000 iterations shows convergence from around 5000 iter- 
ations, and the summary is based on iterations 5000-25 000. The posterior mean for the 
WBC coefficient now stands at —0.34, with a 95% interval confined to negative values. 
The posterior mean for y itself is just under 0.007. The likelihood ratios (3.25) are 
highest (hence chance of outlier status greatest) for observation 17, which has y = 1 (the 
patient survived 65 weeks) but a WBC measure of 100. The coefficient on the dummy 
index for AG status is also increased in absolute size, though like the WBC coefficient is 
estimated less precisely (i.e. the posterior standard deviation is increased over the 


standard logit). 


y ~ E(300) 
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Example 3.15 Travel for shopping To illustrate alternative approaches to robustness 
with binary outcomes to the contamination/misclassification model, consider data 
from Guy et al. (1983) for a panel survey of shopping behaviour. This involved 
84 family households in suburban Cardiff, approximately equidistant from the 
city centre, with the response y; being whether or not the household used a city centre 
store during a particular week. The predictors are income (Inc), household size 
(Hsz, for number of children) and whether the wife was working (WW). The first 
two covariates are respectively ordinal (with levels 1 for income under £1000 up to 
8 for incomes over £15000) and discrete — but both are taken as continuous. 


Wrigley and Dunn (1986) consider issues of resistant and robust logit regression 
against a substantive background, whereby positive effects of income and working 
wife on central city shopping are expected, but a negative effect of household size. 
They argue that it is preferable to work with all the explanatory variables considered 
relevant and that exclusion of predictors because of one or two outliers or influential 
observations should be avoided. Let 


Tm; = Pr(yi = 1) 


Wrigley and Dunn cite estimates from a maximum likelihood fit as follows (with 
standard errors in brackets) 


logit(m;) = —0.72 0.14Inc — 0.56Hsz 0.83 WW (3.27) 
(0.91) (0.23) (0.19) (0.54) 


So the significance of the working wife variable is only marginal (i.e. income is signifi- 
cant at 5% only if a one-tail test is used). 

Here we adopt mildly informative priors (cf. Chib, 1995) in a logit link model: 
N(0.75, 25) priors on Bme and Bww are taken in line with an expected positive 
effect on central city shopping of income and female labour activity, while a 
N(—0.75, 25) prior on By,, reflects the expected negative impact of household size. 
From a 10000 iteration three chain run a stronger effect of income is obtained than 
in Equation (3.27). The analogous equation to that above, with posterior standard 
deviations in brackets, is 


logit(m;) = —0.56 0.39Inc — 0.61 Hsz 1.07 WW 
(0.97) (0.25) (0.20) (0.59) 


The 90% credible intervals on both the income and working wife variables are entirely 
confined to positive values, though this is not true for the 95% intervals. The highest 
deviance components are obtained for observations 5, 55, 58, 71 and 83. These points 
account for about 20% of the total deviance (minus twice the log likelihood, which 
averages about —50). The highest deviance is for case 55. The conditional predictive 
ordinates are lowest for cases 55 and 71. 

Similarly, case 55 is the highest average residual under the latent utility logit model of 
Albert and Chib (1993) — see Model (B) in Program 3.15. The logit is matched by a 
latent utility following Student ¢ errors with 8 degrees of freedom (logistic errors are 
‘heavy tailed’ like the Student ¢ density). Trying Student ¢ sampling with smaller degrees 
of freedom (e.g. t(2) errors) makes little difference to any conclusions about credible 
intervals for income and working wife coefficients — the 95% intervals still straddle zero. 
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Note that either the Normal or Student ¢ latent utility approach of Albert and Chib 
allow different known weights for each point. This approach might therefore be used to 
downweight certain observations using an influence function based on deviance or 
leverage contributions (see the Exercises). 

As an illustration of potential distortion from particular data points Model C in 
Program 3.15 applies the full cross-validation methodology based on single case omis- 
sion. The differences between Bimo and Brey for income (Del.beta.Inc[] in Model C) 
show that major changes in this coefficient are caused by exclusion of particular points. 
The average value of Bine from the standard logit link model is 0.39 with posterior 
standard deviation of 0.25. Exclusion of case 71 raises this coefficient by over half this 
standard deviation, to around 0.58, while exclusion of case 58 raises it to around 0.47. 
By contrast excluding case 29 lowers the coefficient to around 0.26. There might 
therefore be grounds for excluding case 71 at least, as it figures as an outlier and is 
influential on the regression. As discussed above, other options to assess robustness of 
inferences may be used, which retain the suspect case(s), but model them via contamin- 
ated priors or discrete mixture regressions. 


3.7 REVIEW 


The Bayesian approach to linear regression with continuous outcomes, general linear 
modelling (including discrete outcomes), and nonlinear regression offers several advan- 
tages. These include the facility with which repeated sampling estimation may be used to 
select the most appropriate subsets of regressors (and hence best regression model) in 
situations where there is multicollinearity. Recent developments include modifications 
of the univariate selection procedures (such as in Kuo and Mallick, 1998) to allow 
multivariate responses, or to vary hyperparameters so that more or less parsimonious 
models are chosen (Brown et al., 1998). The facility with which varying levels of 
prior information regarding regression slopes is another advantage. Prior information, 
such as from historical data or previous studies of similar data, may allow one to set 
a prior expected value of B based on past information (Birkes and Dodge, 1993), or 
to assign some weight to prior data Dp relative to that assigned to likelihood of the data 
D of the current study (Chen et al., 2000). A range of non-parametric regression 
methods — though some are possible to fit without a full Bayesian methodology — allows 
flexible regression modelling where effects are non-constant and this may be combined 
with predictor selection (Smith and Kohn, 1996). Both non-parametric regression 
and other approaches to regression (e.g. determining outliers via scale mixture 
Student ¢ regression) allow for a range of sensitivity checks to standard modelling 
assumptions of linear effects and constant variances. On the other hand regression 
in certain settings, e.g. discrete mixture regression, is quite complex under 
a Bayesian approach with repeated sampling estimation leading to problems of 
identifiability. Issues such as prior specification and related questions of identifi- 
cation, model choice and robust inference, recur in the following chapters which 
consider more specialised data structures (multi-level data, time series data and panel 
data). 
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Table 1 Ultrasonography ratings 


Diagnostic ratings Predictors 
Tumor Site Status (Metastasis) R1 R2 R3 R4 R5 XI X2 X3 
Colon N 27 17 2 0 0 0 0 0 
Colon Y 4 1 2 2 13 1 0 0 
Breast N 6 5 2 1 0 0 1 0 
Breast Y 0 2 0 2 5 1 1 1 
Table 2 Kyphosis data 
Sequence No Kyphosis Age Number Start Age Group 
1 absent 71 3 5 24 
2 absent 158 3 14 57 
3 present 128 4 5 47 
4 absent 2 5 1 2 
5 absent 1 4 15 1 
6 absent 1 2 16 1 
7 absent 61 2 17 22 
8 absent 37 3 16 17 
9 absent 113 2 16 40 
10 present 59 6 12 21 
11 present 82 5 14 30 
12 absent 148 3 16 54 
13 absent 18 5 2 9 
14 absent 1 4 12 1 
16 absent 168 3 18 59 
17 absent 1 3 16 1 
18 absent 78 6 15 27 
19 absent 175 5 13 60 
20 absent 80 5 16 28 
21 absent 27 4 9 13 
22 absent 22 2 16 11 
23 present 105 6 5 38 
24 present 96 3 12 34 
25 absent 131 2 3 49 
26 present 15 7 2 7 
27 absent 9 5 13 5 
29 absent 8 3 6 4 
30 absent 100 3 14 36 
31 absent 4 3 16 3 
32 absent 151 2 16 55 
33 absent 31 3 16 14 
34 absent 125 2 11 45 
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Table 2 (continued) 
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Sequence No Kyphosis Age Number Start Age Group 
35 absent 130 5 13 48 
36 absent 112 3 16 39 
37 absent 140 5 11 52 
38 absent 93 3 16 33 
39 absent 1 3 9 1 
40 present 52 5 6 20 
41 absent 20 6 9 10 
42 present 91 5 12 32 
43 present 73 5 1 26 
44 absent 35 3 13 15 
45 absent 143 9 3 53 
46 absent 61 4 1 22 
47 absent 97 3 16 35 
48 present 139 3 10 51 
49 absent 136 4 15 50 
50 absent 131 5 13 49 
51 present 121 3 3 44 
52 absent 177 2 14 6l 
53 absent 68 5 10 23 
54 absent 9 2 17 5 
55 present 139 10 6 51 
56 absent 2 2 17 2 
57 absent 140 4 15 52 
58 absent 72 5 15 25 
59 absent 2 3 13 2 
60 present 120 5 8 43 
6l absent 51 7 9 19 
62 absent 102 3 13 37 
63 present 130 4 1 48 
64 present 114 7 8 41 
65 absent 81 4 1 29 
66 absent 118 3 16 42 
67 absent 118 4 16 42 
68 absent 17 4 10 8 
69 absent 195 2 17 63 
70 absent 159 4 13 58 
71 absent 18 4 11 9 
72 absent 15 5 16 A 
73 absent 158 5 14 57 
74 absent 127 4 12 46 
75 absent 87 4 16 31 
76 absent 206 4 10 64 
77 absent 11 3 15 6 
78 absent 178 4 15 62 
79 present 157 3 13 56 
80 absent 26 7 13 12 
81 absent 120 2 13 43 
82 present 42 7 6 18 
83 absent 36 4 13 16 
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EXERCISES 


1. For the ship damage data of Example 3.3, try allowing for extravariability by adding 
a random effect in the log link for the regression mean. How do regressor effects then 
compare with the Poisson model? How might robustness to outliers be gained under 
this approach? Try assessing gain in fit by the effective parameter method using the 
Poisson deviance (Chapter 2). 


2. Also in Example 3.3, assess the expanded variance function (Model C) against the 
regression variance function (Model D) via the effective parameters method and 
predictive loss methods. Under the predictive loss approach try alternate values of k 
such as k = 5, k = 10, and assess stability of inferences on model choice. 


3. Following Example 3.5 compare models for schizophrenia cases in small areas 
involving (a) the deprivation index as sole predictor and (b) the deprivation 
and anomie indices. Evaluate model choice using classical fit measures (i.e. 
deviances), the DIC method, and pseudo marginal likelihood based on single case 
omission. 


4. In Example 3.6 add code for the quantities sı and s2 comparing risks of infection when 
additional risks are present, and assess the probability that sı exceeds s2. 


5. Consider the influence discrepancy functions in Example 3.7 if the four cases 5, 59, 61 
and 80 are excluded simultaneously. 


6. Suggest how the binomial link mixture model (Section 3.3.4) could be adapted to 
modelling an unknown extra-variation parameter k in the Poisson link as in Equa- 
tion (3.5), with —1 < x < 1. Try this approach with the ship damage data of Example 
3.3. 


7. In Example 3.9, try the probit model for the cumulative probability of diagnosis of 
metastasis, using the Albert—-Chib sampling method. 


8. In Example 3.11 consider a logit link and direct Bernoulli sampling (i.e. without 
referring to an underlying continuous variable) to reproduce the linear and GAM 
models. Also, try a model with a random walk prior on the age effect, such that the 
Bernoulli probability is 


Br = Bo Biri Borax Bax 


The prior on B}, will need to take account of the varying gaps between (grouped) 
ages. 
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9. In Example 3.15, the code for Model A includes the influence function of Bianco and 
Yohai (1996), where 


gd =d- Êje (d<o) 
=c/2 (d >) 


di is the deviance contribution of case i, and c = — log (k), and k is small (between 
0.01 and 0.05). In Program 3.15, c is set to 3.6 ~ — log (0.03). Try using the posterior 
means on the weights g(d;)/d;, scaled to average 1, as alternative precisions in the 
Albert—Chib method with Student ¢ sampling via scale mixing. How does varying K 
to 0.01 affect inferences? 
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CHAPTER 4 


Analysis of Multi-Level Data 


4.1 INTRODUCTION 


The behaviour and experience of, and outcomes for, individual actors in society is 
defined by the institutions or social groups to which they belong. Failing to take 
account of such contextual effects on individual level outcomes may lead to what are 
known as atomic or atomistic errors of inference (Courgeau and Baccaini, 1997). 
However, analysis of aggregates of individual outcomes (e.g. studies linking area health 
status to area income) may be subject to a so-called ecological fallacy by neglecting to 
include the mode of operation of risk factors at the individual level. Hence, an analytic 
approach simultaneously taking account of different levels of aggregation but consider- 
ing the outcome at an individual level may be preferred and less subject to either fallacy. 
The statistical task is then ‘to partition the variation in a response variable as a function 
of levels in the hierarchy and relate this variability to descriptions of the data structure’ 
(Browne et al., 2000). 

Examples where this type of approach is relevant include pupil attainment within 
schools, psychopathological behaviour within families (Martinius, 1993), or illness or 
mortality rates among residents classified by county or local authority (Brodsky et al., 
1999). In these cases, pupils, offspring and individual community residents define the 
lower level, known as level 1, of the data hierarchy, and the groups they are members of 
define the higher level or level 2. Further levels of aggregation are possible, with lower 
level clusters or groups (e.g. classes of pupils) arranged within schools at level 3, which 
are further arranged according to local education authority at level 4. Another multi- 
level structure is defined by repeat observations on an individual subject (e.g. in growth 
curve models), in which the observations on the same individual constitute the first or 
most basic level of analysis and the subjects themselves are at level 2. 

The model will need to reflect typical features of multi-level data. Thus, individuals 
within groups tend to be more alike than individuals across groups. Similarities between 
pupils taught in the same school, or between residents of the same neighbourhood, then 
generate a correlation at group level. The schools or neighbourhoods constitute ‘clus- 
ters’ of the level 1 units and the correlation of responses within clusters is denoted intra- 
cluster correlation (Goldstein, 1995; Humphrey and Carr-Hill, 1991). A multi- 
level analysis may then seek to identify the effects of both cluster and individual level 
variables on the individual level outcome. 
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Such an analysis will take account of the positive correlation within clusters which 
otherwise will result in understating the standard deviation of the effects of cluster 
variables. As well as introducing known influences at cluster level, we may introduce 
shared unobserved variables within clusters which to some degree model the intra- 
cluster correlation among the responses. We are obviously also interested in the impact 
of individual attributes on the outcome, and whether they vary across clusters. A multi- 
level model will provide estimates of cluster-specific regression estimates, namely vary- 
ing intercepts and slopes, that use the information from subjects within each cluster but 
also borrow strength from other clusters. 

Consider the case of spatial epidemiology, where both individual and contextual 
factors are relevant in explaining variations in individual health outcomes. The context 
is provided by various levels of geographic aggregation, and effects of area variables 
have been termed ecologic effects by Blakely and Woodward (2000), Morgenstern 
(1998) and others. Health inequalities are associated with many individual risk factors 
(e.g. marital status, ethnicity, social class), but are also known to show wide spatial 
variation beyond that which can be attributed to differences in the social composition of 
the population. An example is the North-South contrast in both mortality and illness 
rates in Britain, which is not wholly explained by differences in social and demographic 
structure (see Example 4.2). 

At the cluster level therefore some contrasts in mean rates may be compositional — 
merely reflecting the aggregate effect of the socio-economic composition of each area’s 
or region’s residents.' However, there may additionally be genuine contextual effects: 
for instance, if the health experience or behaviour of an individual of a given type (e.g. 
as defined by age, class, etc.) varies across regions (Duncan et al., 1999). Morgenstern 
(1998) gives some interesting examples of contextual effects: thus, residence in a mainly 
Protestant area may raise the suicide risk among non-Protestants, so that there is an 
interaction between individual religion and the religious composition of areas of resi- 
dence. If, for instance, x denoted individual religion (Protestant or not), X the propor- 
tion Protestant in areas of residence, then prediction of y (suicide risk) might involve a 
model y = bọ bıx bX  b3xX, where bz reflects a direct contextual effect and b3 an 
indirect one, through the interaction just mentioned. 

One possible approach to such contrasts across contextual settings may be denoted 
the tabulation method: this involves separate regression analyses for each cluster and 
comparison of the resulting effect estimates across clusters. Separate analyses for each 
cluster are, however, not the best way to study the interplay of contextual and compos- 
itional effects, since they neglect within cluster dependence. 

Multi-level analysis often raises further complex statistical issues. For instance, 
cluster effects (such as the effect of school variables on pupil attainment) may involve 
small sample sizes and possible clusters which are outliers. Hence, robust methods may 
be needed to assess inferences and here Bayes methods (e.g. for heavy tailed alternatives 
to say normally distributed cluster effects) come into play. There may be issues of 
heteroscedasticity where for example the level one variance is not constant but depends 
on explanatory variables — an example might be a residual variance at level 1 differing 
between boy and girl pupils. Multilevel outcomes may well also be multivariate (e.g. two 
types of exam score for pupils), and the repeated observations of scores on the same 


' Note that this term does not refer to a decomposition of variance in the sense of spatial vs unstructured 
errors in spatial outcomes as will be considered in Chapter 7. However, spatial correlation of cluster effects 
(e.g. among varying intercepts and slopes) may be relevant when the clusters are geographic areas. 
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subject constitute the first level in this instance. Further, while many multi-level appli- 
cations consider only inference within the sample or survey, one may also use multi-level 
approaches in making inferences beyond the survey, specifically for small domain 
estimation. The goal may be to make inferences about the average outcome in a 
domain (formed, say, by cross-classifying demographic attributes with relatively low 
scale geographic level) where the survey data themselves contain only a few units, 
perhaps none, from that domain. This may involve referring to other sources of data 
on the domains (e.g. Census data as well as the survey results themselves). 

A Bayesian estimation perspective, especially one based on MCMC methods, may 
have advantages over maximum likelihood estimation based on (say) iterative general- 
ised least squares. Thus, GLS estimates may understate random effect variances because 
not all sources of sampling uncertainty in the relevant parameters are allowed for in 
their derivation. This is particularly so for small numbers of clusters (e.g. the number of 
classes or schools), small cluster sizes, and in unbalanced hierarchical data sets, where 
cluster sizes differ between clusters. GLS methods may also suffer from convergence 
problems in these situations. A Bayesian model based on iterative sampling may be 
useful in providing simplified ‘significance tests’ on parameters or derived quantities, 
based on counting samples where the hypothesis holds. By contrast, a maximum 
likelihood approach typically rests on asymptotic normality and relevant standard 
errors for derived quantities (e.g. for a difference in level 1 variances if these are made 
functions of regressors) may be difficult to derive. However, although the Bayesian 
approach may be advantageous in some respects, there may be the sensitivity to prior 
specifications, especially for small samples, or small numbers of clusters at higher levels. 
This may be the case, for instance, regarding covariation of the random effects at 
different levels (Daniels, 1999; Browne and Draper, 2000; Daniels and Kass, 1999). 

This chapter adopts a Bayes perspective in terms of the application of multi-level 
concepts to both continuous and discrete data (Section 4.2) Examples of different 
outcome types are provided by two level models applied to continuous pupil attainment 
and binary health outcome data. Then Section 4.3 considers multi-level models includ- 
ing heteroscedasticity at one or more levels, with variances dependent on continuous or 
categoric regressors. This is illustrated with the pupil attainment data set and with a 
binary attitude outcome from a survey. Questions of robust inference are raised in 
multi-level models, where multivariate normality of cluster and higher level effects is a 
typical default, but may be problematic if cluster sample sizes are small (Seltzer, 1993). 
This question is considered in Section 4.4. The chapter concludes in Sections 4.5 and 4.6 
by considering multivariate outcomes within a multi-level context, and the application 
of multi-level concepts to derive population wide predictions within domains defined by 
survey variables. 


4.2 MULTI-LEVEL MODELS: UNIVARIATE CONTINUOUS 
AND DISCRETE OUTCOMES 


The nesting of observations allows considerable scope for differentiating or indexing 
regression and or error variance effects, guided both by subject matter indications and 
by statistical criteria such as model parsimony and identifiability. For example, suppose 
we have pupils arranged by class and school (levels 2 and 3, respectively); then the effect 
of the level 2 variable, classroom size, on a continuous attainment score (the level 
1 outcome) may be affected by school resources (at level 3). The effect on attainment 
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of a level 1 variable, such as pupil ability, may be differentiated both by class and 
school. 

In terms of formal model specification, consider simply the nesting of pupils within 
classes. Suppose we have a two level data set with continuous observations Yj in classes 
j=1,..J at level 2, and pupils i= 1,..n; within classes at the lowest level. Predictors 
may be defined at each level, say Xn;(h = 1,..p) at level 1 and Z),;(h = 1,..q) at level 2; 
an example of the first type of predictor might be pupil ability or gender, and of the 
second, class size or teacher style. In a multi-level regression model, covariate effects and 
errors may be specified at each level and are potential cumulative sources of explanation 
or error in the outcome. 

Thus, in predicting the outcome at level 1, one may define error terms v; at level 2, and 
uj at level 1 (both with zero mean). Suppose p = 2 with X\ = ability and X2 = gender 
(1 = girls, 0 = boys). Then a two-level model for continuous outcomes with fixed 
impacts of the level 1 predictors might take the form 


Yj=a Bij PBX Vj Uy 
Combining the intercept and level 2 error gives 
Yj= o; BiXiy B2Xy uy (4.1) 


so that the a; are varying regression intercepts over the level 2 units. The mean response 
at level 1 would be conditional on the random intercept: 


E(Y¥i|Xy, u) = BiXiy BX; o 


So the centred Normal version of this two level continuous data model, where 
uj; ~ N(0, o°) would be 


Y; ~ N(B: Xy B2X2y aj, o°) 


Note that this is the form for Gaussian or Student ¢ data used in BUGS (remembering 
of course that BUGS uses a different parameterisation of the Normal). 

In this model the impact of level 1 predictors is constant across higher level contexts 
and the effect of the cluster is reflected only in different intercepts. Typically, the cluster 
effects (random intercepts) a; would be exchangeable errors with no correlation struc- 
ture, that is unstructured ‘white noise’; however, if the clusters were geographic areas 
one might envisage them being spatially correlated. One may well seek to explain 
varying intercepts in terms of the characteristics of clusters (or characteristics of higher 
level groupings of the clusters themselves). 

The interpretation of the regression coefficient 8, for the Ath predictor Xj, in (4.1) is 
then as a change in the expected response for a unit change in that predictor, with the 
error term at level 2 held constant. By contrast, multi-level models commonly differen- 
tiate the effects of the level 1 predictors according to clusters (here school classes) 
j= 1,.. J. So the effect of pupil ability or gender may be differentiated as follows 


Yj =(P SXi (B2 yX; aj Uy 
where 8); and 82; express the differential effect (with mean zero) that a pupil’s class has 
on the impacts on attainment of pupil ability or gender. One might also, conflating the 
fixed and zero-centred random effects 5);, write this model as 


Yi = ByXiy PyXzy Qj My (4.2) 
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As noted by Clayton (1996), the Bayesian viewpoint means that there is no longer a need 
to partition a parameter effect into fixed and random components. So we have a model 
with varying intercepts and slopes. To enable pooling of strength typically involves 
assuming these effects are random, namely drawn from a parametric population wide 
distribution over school classes — though fixed effects approaches may arguably have 
more validity in certain circumstances (Rice et al., 2000). 

A common assumption is for univariate or multivariate normal errors for the higher 
level effects, namely a; in (4.1), and {a,;, Baj} in (4.2). Thus, the priors for varying 
intercepts and slope effects for schools might be taken as independent Normals 


Qaj ~ N(A, Ta) 


Baj ~x N(B», TB.h) h = 1, -p 


where A and the B, are known constants. Alternatively, a p 1 multivariate normal 
density could be used allowing for covariation between intercepts and slopes. As 
mentioned above, more informative from a subject matter viewpoint may be to relate 
variations in the parameters {a;, B,;} to cluster predictors Z;. As considered further 
below, there may also be gains from modelling the variances as functions of predictors. 
In applications where the clusters are geographic areas, the multivariate density for 
{a;, Bay might allow for spatial correlation (see, for example, Leyland et al., (2000), and 
the discussion in Chapter 7). 

While the above notation implies a nested arrangement of the data, it is often 
convenient, especially with unequal n; in each cluster, to arrange the data in terms of 
a single subject index, so that a univariate outcome y would be arranged in terms of 
a vector of length T, = nj. A vector of cluster membership indices, with values 
between 1 and J would also be of length N. This type of arrangement is also useful 
for crossed rather than nested data structures, for example, pupil data by school and 
area of residence. 


4.2.1 Discrete outcomes 


Multi-level analysis of discrete outcomes may be carried out in the appropriate linked 
regression (e.g. via a log link for a Poisson dependent variable or logit link for a 
binomial variable) (Goldstein, 1991). This may involve introducing into the regression 
structure such forms of random variation that are needed both to describe the hierarch- 
ical data structure and account for heterogeneity beyond that expected under the model 
form. For example, consider a two level Poisson count Yy for i= 1,..n,; units within 
j =1,..J groups, and with T, = n;. Denoting the Poisson means as j.,; and assuming a 
single level 1 predictor X;;, a log-linear regression may be specified 


log (Mi) =O; BX 
or with a level 1 error term 
log (yj) =a; BAX Uy (4.3) 


The first form assumes heterogeneity will be largely accounted for by the group specific 
intercepts and slopes, while the second allows an unstructured error with variance g? to 
account for residual heterogeneity beyond that associated with the Poisson regression 
(Congdon, 1997, Ghosh et al., 1998). 
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This sort of over-dispersion is apparent in deviances at the posterior mean exceeding 
the available degrees of freedom (McCullagh and Nelder, 1989). For a binary or 
binomial outcome, there is a similar choice. Thus, suppose Yj is binomial, with 
Y; ~ B(my, Ri), where Ry is the total number of events, and Yj; is the number of 
‘successes’. Then a logit link model with cluster only intercepts 


logit(m;) = BX; Qj 


may need (if there is still over-dispersion) to be expanded to include level 1 variability, 
so that 


logit(m;) = BX; Qj Uij (4.4) 


It may be noted though that a suitable prior for the level 1 variance in a Poisson or 
binomial model, as in Equations (4.3) or (4.4) may require care to avoid large values of 
o? since otherwise the random effects tend to produce too close fit to the data (Johnson 
and Albert, 1999, p. 113). Thus Johnson and Albert suggest for their application an 
informative inverse gamma prior for the variance, with parameters 5 and 1.5, which 
result in most of the mass for the variance being concentrated between values 0.1 and 1.5. 

One advantage of simulation based Bayesian estimation and inference is the ease of 
direct modelling of Poisson or binomial outcomes without resorting to weighted least 
squares approximations. For example, over-dispersion may be modelled via Normal or 
possibly Student ¢ random errors in the log-link, as in Equation (4.3). One might also 
adopt conjugate forms to model the level 1 variation, with a Poisson outcome for 
instance being modelled as 


Yj ~ Poil myy) 
where the y; are taken as gamma variables with mean 1. Alternatively, for suitably large 
counts Normal approximations to the binomial or Poisson may be used. This implies 
Normal sampling (or possibly Student ¢ sampling) but with a variance function appro- 
priate to the form of the data. Thus for a binomial outcome, with Yj events occurring in 
Ry at risk 


Yy ~ N(wy Ry, Vij) 
with 


and with the regression for the m; involving a logit or probit link. In Equation (4.5) 
b? = 1 would be expected (approximately) if the level 1 variation were binomial, 
whereas heterogeneity beyond that expected under the binomial yields b? well above 1. 


Example 4.1 Language scores in dutch schools As an example of the two level 
situation for continuous data, consider data on language scores in 131 Dutch elemen- 
tary schools for T, = 2287 pupils in grades 7 and grade 8, and aged 10 and 11 (Snijders 
and Bosker, 1999). In each school a single class is observed, and so the nesting structure 
is of pupils within J = 131 classes. We are interested in the impact on language scores of 
pupil level factors such as IQ, and pupils social status (SES). Also relevant to explaining 
possible differences in intercepts and slopes (on IQ and SES) are class level variables: 
these include the class size, the average IQ of all pupils in the class, and whether the class 
is mixed over grades: thus COMB = 1 if the class includes both grade 7 and grade 8 
pupils and COMB = 0 if the class contains only grade 8 pupils. These variables are 
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denoted ClassSize[], IQ.class[]) and comb[] in Program 4.1. Following Snijders and 
Bosker, the pupil variables and the class variables are centred, and this aids in achieving 
earlier MCMC sampling convergence. Note that in Program 4.1 the data are arranged 
by subject in a ‘single string’ vector, with class memberships 1, 2...131 in a vector of 
length 2287. 


A model with random intercepts but fixed impacts of the two pupil level vari- 
ables (Model A) is estimated first. The intercepts are explained by three class 
level variables: average IQ, mixed grade class or not and class size. So Model A is 
defined as 


Yj ~ Nm O') f=1,..437=1,..5 
wy =O BX- X) Ba- X) 
a; ~ Nw; $z) 


y= Y(Zy- Z) y3(Zy—Z2) Y4(Zy- Zs) 


where X, and X are pupil IQ and SES and Z1, Z2 and Z3 are, respectively, class IQ, 
class grade type, and class size. The assumed priors are ẹ~ G(0.001, 0.001), 
bpa ~ G(0.001, 0.001), B ~ N(O, 107), k = 1, 2 and y, ~ N(O, 10’), k = 1, 4. 

Posterior estimates are based on the second half of a three chain run of 5000 
iterations. Initial values for B and y in the three chains are provided by null start values 
(chain 1), start values in chain 2 equal to the posterior means from the trial run, and 
values in the third chain equal to the 97.5% point from the trual run. Starting values for 
the inverse variances are also based on the test run. Convergence by the Brooks—Gel- 
man criteria (Brooks and Gelman, 1998) is evident at well under 1000 iterations, with 
mixing satisfactory in terms of rapid decline in the autocorrelations between parameter 
iterates at successive lags. 

The parameter estimates for model A (Table 4.1) show that pupil IQ and SES both 
have clearly positive impacts on attainment. The intercepts vary between classes in such 
a way that classes with higher average IQs and containing only grade 8 pupils have 
higher attainments. There is also a weak negative impact y4 of class size, with 95% 
interval (from —0.14 to 0.06) biased towards negative values. One option here would be 
to try a transform of the class size variable, since a negative value might be expected on 
subject matter grounds. 

A second model adopts random slopes on pupil IQ and SES; the same class 
level predictors as in Model A are used to explain variations in the slopes on these 
two pupil predictors. Thus in Model B 


Yj ~ Nap PD i=l... fal. J 

by = By ByXX) By — X2) 

B; ~ N30,;, V-B) (4.6a) 
Veg = yk YalZy— Z) YlZy- Z) Yæ(Zy-— Z3) (4.6b) 


The varying class level parameters B; = (Bij, Bz» B3) are taken to be trivariate Normal 
with means v; = (vij, vy, v3j). The initial values for the three chains are provided by the 
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Table 4.1 Dutch language tests 
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Model A 

Mean St. devn. 2.5% 50% 97.5% 
Deviance Precisions 14930 16.98 14900 14930 14960 
ob 0.025 0.001 0.024 0.025 0.027 
þ.a 0.131 0.022 0.093 0.129 0.178 
Level 1 Effects 
Pupil IQ 2.21 0.07 2.06 2.21 2.35 
Pupil SES 0.16 0.01 0.13 0.16 0.19 
Level 2 Model for Level 1 Intercepts 
Intercept 41.58 0.44 40.72 41.58 42.46 
Class IQ 1.10 0.32 0.48 1.10 1.72 
Combined Class —2.10 0.82 —3.75 —2.10 —0.50 
Class Size —0.038 0.053 —0.142 —0.037 0.065 
Model B 

Mean St. devn. 2.5% 50% 97.5% 
Deviance 14860 24.26 14810 14860 14910 
Level 2 Covariance Elements 
V.Bi1 7.658 1.343 5.373 7.533 10.57 
V.Bio —0.537 0.295 —1.145 —0.528 0.023 
V.Bi3 —0.001 0.067 —0.132 —0.002 0.134 
V.B 0.251 0.087 0.121 0.238 0.449 
V.B —0.013 0.014 —0.044 —0.012 0.012 
V.B33 0.030 0.005 0.021 0.029 0.042 
Level 1 Precision 
ob 0.026 0.001 0.024 0.026 0.027 
Level 2 Parameters (Equation (4.6)) 
Vu 41.58 0.49 40.63 41.57 42.54 
yi2 2.13 0.15 1.84 2.14 2.43 
yi3 0.166 0.036 0.096 0.166 0.238 
y21 0.896 0.361 0.171 0.902 1.586 
Ya —0.013 0.091 —0.189 —0.013 0.166 
Y23 —0.012 0.026 —0.064 —0.012 0.040 
Y31 —1.947 0.886 —3.696 —1.935 —0.228 
Y30 0.380 0.282 —0.139 0.374 0.915 
Y33 0.016 0.064 —0.110 0.016 0.142 
Ya —0.044 0.055 —0.154 —0.044 0.062 
Yy —0.006 0.017 —0.041 —0.006 0.025 
Y43 —0.002 0.004 —0.010 —0.002 0.006 
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same procedure as in model A except that the starting values for the precision matrix on 
the parameters {B;, Bj2, Bj3} are obtained by random draws from the Wishart prior for 
this matrix. This prior is taken to have three degrees of freedom and an identity scale 
matrix. 

A 5000 iteration run with three chains is used for estimation, with the summary based 
the second half of the run. Note that Bayes estimation suggests that there does appear 
to be variation in the impact of SES across the classes, and some suggestion of 
covariation between SES and IQ: by contrast, Snijders and Bosker, using a weighted 
least squares method, failed to gain convergence when this variation and covariation 
was allowed for. 

However, the extended model calls into question the relevance of the full set of cluster 
variables Z; in predicting variability in slopes, and suggests over-parameterisation; the 
exception is the parameter y3, which represents the additional impact of child IQ on 
attainment in mixed classes where COMB = 1. Note that these findings point to 
reformulation of the regression model (4.6b) to explain variable slopes, and not neces- 
sarily to a drawback in the random intercepts and slopes assumption (4.6a). As Snijders 
and Bosker (1999, p. 77) point out, it is not necessary to use all the cluster variables Zp; 
in explaining variability in the coefficients B,;,k =1,...p 1. 


Example 4.2 Long term illness As an example of binary outcome at individual level, 
this example follow Shouls, Congdon and Curtis (1996) in considering variation in 
individual level chances of the binary outcome, namely being long term ill. The data 
used draw on a 2% sample of anonymised individual records (abbreviated as the SAR) 
from the 1991 UK Census, and nested within 278 local authority areas. The full analysis 
focused on the age group 15-64 and on a 10% sample of the SAR data itself, i.e. a 0.2% 
sample of the full Census, amounting to around 90 thousand males and females in this 
age group. We focus here on females aged 45-59, excluding cases with missing covari- 
ates; the covariates are age, non-white ethnicity, being married and being in lower skill 
manual occupations (social classes IV and V). Additionally, an indicator of multiple 
deprivation is included for each individual, and is a tally of yes/no responses according 
to whether the individual is unemployed or the household of the individual does not 
own their home, does not own their car, has no access to a separate bathroom, or lives 
at over | person per room. This tally S appears as the variable sumdep[ ] in Program 4.2, 
which is then transformed to the regressor X = log(1 S). A quadratic term in age is 
used (see Program 4.2); other options might be linear or log-linear terms, though a 
linear effect does not reflect the fact that the chance of long term illness increases steeply 
towards the end of the 45-59 age band. 


The study of Shouls et al. (1996) found that there were contextual impacts of 
individual deprivation: its effect was greater in (more affluent) areas with lower average 
illness levels. This conclusion was based on a bivariate normal model at local authority 
area level for the intercepts and slopes on individual deprivation. Note that the analysis 
of Shouls et al. was based on iterative weighted least squares. Here three models are 
considered: 


© one with intercept only variation (model A); 

e one which replicates the Shouls et al. analysis, but with a Bayes specification of the 
contextual effect just described (Model B); and 

e Model C, in which area variability in the effect of social class is added. 
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Further to the work of Shouls et al., a cluster level variable Z; is introduced into Models 
B and C: thus Zı; = 1 if local authority j is in the North of England. This variable may 
assist in explaining intercept and slope variation. 

It may be noted that varying effects across clusters implies two types of ecologic effect 
on health status: thus define Y; = 1 if individual 7 in local authority j is long term ill 
(and Yj = 0 otherwise). Then Model B states the following: 


Y; ~ Bern(ry) 


logit(™) = Bo B;Dep; B2AgeSq,; 8;LowSkill; 
B4NonWhite; ®;Married; (4.7a) 


where Dep is the transformed (and then centred) deprivation score, and the age squared 
variable is also centred. Model A has varying intercepts as in Equation (4.7), but a fixed 
effect of individual deprivation, Bı; = Bı. 

The level 2 models for the variable intercepts and slopes in Model B are 


Boj = Yoo YoZy £y (4.7b) 
By = Yo YuZy €y (4.7c) 


where the g; = (£o, €1;) are bivariate normal. (In model A, Bo, is univariate Normal.) 
Substituting (4.7b) and (4.7c) into (4.7a) then gives 


logit(m) = Yoo YZ YioDepy YiuZ Dep; B2AgeSq; 
B3;LowSkill; B4NonWhite; ®;sMarriedy ej Dep; 


Thus there is both a direct ‘cross-level effect’ of Zı on the individual outcome (with 
coefficient yo,) and an indirect cross-level effect with coefficient y,; produced by the 
interaction of individual deprivation and the cluster variable (here the North-South 
divide in England). 

In both Models A and B, three chains” are run for 2500 iterations, with convergence 
in the fixed effects and elements of the cluster precision matrix (cluster variance in 
Model A) apparent at under 500 iterations and with satisfactory mixing (fast decay 
in lagged dependence in sampled parameter values); the posterior summaries are based 
on iterations 500-2500. 

Model A shows the anticipated positive impacts of age on the chance of illness and 
a lower rate for married people. It is notable that social class effects on illness exist 
even after allowing for individual deprivation. Model A shows significant variation 
between local authorities in long term illness rates after accounting for these 
observed characteristics. One advantage of sampling based estimation is the maximum 
and minimum intercept may be monitored; these average —1.33 and —3.02. 

This suggests we consider either (a) simply allowing for variable effects on illness 
over area of the individual characteristics (age, etc.), and modelling covariation of the 
variable slopes with the illness rate, or (b) additionally modelling the variability in local 
authority intercepts and slopes in terms of area attributes, such as area deprivation, 


? Initial values are provided by a null (zero values) on the regression coefficients, by the posterior mean of a 
test run, and by the upper 97.5% point of that test run. 
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position in terms of the North-South divide in the UK, and so on. Only option (b) 
allows for the cross-level effects just described. 

Here Model B adopts a step along such a broader perspective in considering a 
bivariate model for varying intercepts and varying effects of individual deprivation, 
and in introducing a simple cluster variable. The results for Model B show a higher level 
of illness in the North than the south of England (a clearly positive yo,), but no effect of 
this variable on the slope variation. There is only a weak negative correlation, averaging 
—0.13, between the illness level and the slope on individual deprivation. Model C shows 
more evidence suggesting a contextual effect, with the varying slopes on social class IV 
and V tending to be higher in low illness areas (Table 4.2). 


4.3 MODELLING HETEROSCEDASTICITY 


Regression models for continuous outcomes, whether single or multi-level, most fre- 
quently assume that the error variance is constant. In a multi-level analysis, for instance, 
this means that the level 1 variance is independent of explanatory variables at this level. 
It is quite possible, however, that the variance (and so the precision also) are related 
systematically to explanatory variables or other characteristics of the subjects. In 
discrete data models (e.g. Poisson or binomial) random effects at level 1 may be 
introduced if there is over-dispersion (see Equations (4.3) and (4.4)), and such errors 
may have a variance which depends on the explanatory variates. Heteroscedasticity may 
also be modelled at higher levels, as the abortion attitudes example below illustrates. 

Recent work comparing marginal and conditional regression specifications in multi- 
level modelling emphasises the need to model heteroscedasticity where it is present 
(Heagerty and Zeger, 2000). Goldstein et al. (1991), and more recently Browne et al. 
(2000), have argued that proper specification of the random part of a multi-level model 
(i.e. allowing for possible non-homogenous variances at one or more levels) may be 
important in inferences on the mean regression coefficients. Therefore, one way towards 
more robust inference in multi-level, and potentially better fit also, is to model the 
dependence of variation on relevant factors; these might well be, but are not necessarily, 
among the main set of regressors. 

If the differences in variance are specified according to a categorical variable Cj 
observed at level 1, then one might simply take variances specific to the levels 1,..K of 
Cj. For instance, if 6, denotes the inverse variance for the kth level of Cj, then one 
might adopt a series of gamma priors 


ob, Sod G(a, bı), cy) aed G(a, by), tee b. c Gag, bx) (4.8) 


Equivalently, log (,;) can be regressed on a factor defined by the levels 1,.., K of Cj. 

One might also model the heterosecdasticity by relating log variances to a general 
function of relevant factors or the entire regression term. An explicit random error 
derivation can be illustrated by a two level model, with 


Yj =Po Bitz Bzy . uy 
Then the level 1 random effect is written as 
uj = Roy Riy Wi 
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Table 4.2 Risk factors for long term illness 


Mean St. devn. 2.5% 97.5% 
Model A 
Log-Likelihood —2710 12 —2737 —2689 
Covariates modelled as random 
Average Intercept —2.18 0.09 —2.38 —2.03 
Intercept variance 0.14 0.04 0.08 0.22 
Fixed Effects of Individual (Level 1) Covariates 
Age Squared 0.00059 0.00008 0.00043 0.00075 
Social Classes IV,V 0.34 0.08 0.17 0.49 
Non-White 0.19 0.18 —0.15 0.56 
Married —0.17 0.09 —0.34 0.02 
Deprivation 0.68 0.09 0.50 0.86 
Model B 
Log-Likelihood —2687 11 —2707 —2665 
Covariates modelled as random 
Average Intercept —2.34 0.09 —2.52 —2.16 
Average Effect of Individual Deprivation 0.64 0.13 0.40 0.89 
Effect of N-S divide on Intercept 0.42 0.09 0.23 0.60 
Interaction between N-S divide and Individual Deprivation —0.01 0.20 —0.42 0.37 
Intercept variance 0.14 0.04 0.08 0.22 
Slope Variance 0.22 0.09 0.09 0.45 
Correlation between Intercepts and Deprivation Slopes —0.13 0.22 —0.55 0.31 
Fixed Effects of Individual (Level 1) Covariates 
Age Squared 0.00060 0.00008 0.00044 0.00075 
Social Classes IV,V 0.34 0.08 0.17 0.50 
Non-White 0.30 0.19 —0.07 0.65 
Married —0.18 0.09 —0.36 —0.01 
Model C 
Log-Likelihood —2673 13 —2697 —2648 
Covariates modelled as random 
Average Intercept —2.32 0.10 —2.52 —2.14 
Effect of N-S divide on Intercept 0.37 0.11 0.16 0.58 
Average Effect of Individual Deprivation 0.66 0.13 0.42 0.90 
Interaction between N-S divide and Individual Deprivation —0.06 0.19 —0.43 0.33 
Average Effect of Individual Social Class 0.23 0.12 —0.01 0.46 
Interaction between N-S divide and Individual Social Class 0.19 0.19 —0.19 0.57 
Intercept variance 0.13 0.04 0.07 0.23 
Deprivation Slope Variance 0.25 0.12 0.09 0.53 
Social Class Slope Variance 0.25 0.11 0.09 0.52 
Correlation between Intercepts and Deprivation Slopes 0.00 0.24 —0.47 0.48 
Correlation between Intercepts and Social Class Slopes —0.32 0.22 —0.68 0.15 
Correlation between Class Slopes and Deprivation Slopes —0.20 0.27 —0.67 0.33 
Fixed Effects of Individual (Level 1) Covariates 
Age Squared 0.00060 0.00008 0.00044 0.00076 
Non-White 0.31 0.19 —0.07 0.66 
Married —0.19 0.09 —0.37 —0.01 
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where W; = Xj8 is the total linear regression term’, and where 
var( Roy) = a; var( Rij) = o? 
and cov(Roy, Ri) = o1. Hence, 


c 2 2 1772 
Vi — var(u;j) = 09 2001 Wi O07 Wi 


As Snijders and Bosker (1999, p. 114) note, such a formula can be used without the 
interpretation that of and of are variances and oo) a covariance; the formula may 
simply be used to imply that the level 1 variance is a quadratic function of Wy. Note that 
in BUGS, one must transform back from precisions to variances to find how variances 
differ between groups or subjects defined by different predictors. 


Example 4.3 Language score variability by gender This example continues Example 4.1 
in terms of language scores of Dutch school children. However, instead of assuming a 
constant level 1 variance, we consider possible heteroscedasticity according to pupil 
characteristics. As a simple illustration of heteroscedasticity, consider the pupils in 
terms of grouped IQs, and the resulting averages and variances of the scores (Table 4.3). 


It is apparent that at IQs above 12, there is a lesser variability in test scores (as well as 
higher average attainment). 

Snijders and Bosker (1999, p. 111) consider the same phenomenon, but by pupil 
gender. We follow their analysis and include a ‘Bayesian significance test’ to assess 
whether in fact the gender specific variances do differ in terms of conventional signifi- 
cance levels. Let G; denote the gender of pupil į in class j (= 1 for girls, 0 for boys). 
Variable regression slopes for child IQ are assumed, but a homogenous regression effect 
of SES and gender. A single cluster attribute (Z, = class IQ) is used to explain variation 
in intercepts and the child IQ slopes. The model, coded in Program 4.3, may then be set 
out as follows: 


Yy~N(ny $z) f= Leg fa l.d 

by = By ByAQ;-IQ) B,(SES;—SES) B4Gy 
P; ~ No(;, Vp) 

Vig =VWe Ya(Zy—Z1) k=1,2 


log b; = C1 Gij 


3 Alternatively, a single regressor might be used, with 
uj = Roy RiyXy 


Hence, 


2 2 y2 
Vij = var(uj) = 00 200X; oX; 
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Table 4.3 Means and variances of scores by IQ group 


IQ group Average language St. devn. of 


score language score 
4-5.99 28.3 8.1 
6-7.99 28.8 8.5 
8-9.99 32.3 7.7 
10-11.99 37.7 8.1 
12-13.99 43.9 6.8 
14-15.99 48.5 5.5 
16 50.2 4.7 


Priors with large variances are assumed for the B and y coefficients, but for numerical 
stability cı and c2 are assigned relatively informative N(0, 1) priors. 

Analysis is based on three parallel chains and a 5000 iteration run with 500 burn-in. 
Mixing and convergence are satisfactory. In Table 4.4, R.B[1, 2] denotes the correlation 
between intercepts and IQ slopes, and shows a clear contextual effect: classes with lower 
than average attainment have higher impacts of individual IQ. The coefficient c2 shows 
that girls have higher precision (and hence lower variance) in their language scores, and 
the posterior mean variances for boys and girls are, respectively, 38.6 and 36.2. 

It may be noted that the coefficient c2 straddles zero throwing doubt on a clear 
difference in variances. To assess whether the variance for boys exceeds that of girls, a 
significance test based on the proportion of iterations where the condition holds, shows 
a significance rate around 85%. Furthermore, a reduced model with equal gender 


Table 4.4 Heteroscedasticity in Level 1 language score 


variances 

Mean St. devn. 2.5% 97.5% 
Deviance 14780 20.64 14740 14820 
R.BIL, 2] —0.55 0.15 —0.79 —0.22 


Gender specific level 1 variances 


o? (boys) 38.60 1.68 35.45 41.97 
o? (girls) 36.23 1.66 33.10 39.60 
c1 —3.65 0.04 —3.74 —3.57 
C2 0.064 0.064 —0.061 0.191 


Fixed slopes 


B3 0.15 0.01 0.12 0.18 
B4 2.47 0.25 1.97 2.96 


Cluster model for variable intercept and IQ slope 


Yi 39.57 0.32 38.94 40.19 
Yio 2.28 0.08 2.12 2.44 
Y1 1.07 0.32 0.42 170 


Yn —0.11 0.08 —0.27 0.05 
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variances provides no increase in average deviance, and so one may conclude that there 
is in fact no pronounced evidence of different gender variances. 


Example 4.4 Attitudes to abortion Heagerty and Zeger (2000) consider data from 
four waves of the British Social Attitudes Survey (1983-86) on attitudes to abortion. 
Level 1 is then the survey wave, level 2 is the survey respondent and level 3 is district of 
residence. Of particular interest in their analysis are the modelling of heterogeneity at 
levels 2 and 3 (subjects and clusters), and of heteroscedasticity at level 2, and resulting 
impacts on fixed regression coefficients. The T, = 1056 responses are classified by 
district j (J = 54), subject i (with 264 subjects), and by year ¢ within subject (t = 1, 4). 


Heagerty and Zeger consider a dichotomisation of abortion attitudes, namely the 
views encompassed under ‘no legal restriction’ as against ‘possible legal restriction’. 
Thus, Yj; = 1 if a person believes abortion should be permitted in a range of hypothet- 
ical situations, while Yj, = 0 if a person believes abortion should not be allowed in one 
or more of these circumstances. At subject level the covariates are year, social class (1 = 
middle, 2 = upper working, 3 = lower working), sex (l= male, 2 = female), and religion 
(Roman Catholic = 1; Protestant or Church of England = 2; other religion = 3; no 
religion = 4). Because these covariates are fixed over time, they are denoted Xj. They 
are modelled as categorical factors, with the first level in each having a null effect. Note 
that, for a factor with F levels, if B; = 0 and only B5,..B,- are free coefficients, the 
profile of centred effects (averaging zero rather than with a corner constraint) can be 
obtained by monitoring the transformed coefficients 


kk = Be — B 
At cluster level (i.e. district of residence), there are an intercept and an overall percent- 
age Protestant (Z;): this variable measures the religious context or social environment 
governing attitudes as opposed to the individual’s creed. 


Thus a three level model (times within respondents within districts) with a logit link 
for the binary outcome, and with no random effects, may be specified as 


Yi ~ Bern(tyr) 


logit(t,) =e nN, BX; YZ 


This independence model (Model A) therefore includes effects for year, social class, 
religion, gender and the district variable. ọ is the overall intercept and the n, are year 
effects with year 1 effect being zero. Early convergence in a three chain run is apparent 
in terms of Gelman—Rubin summaries for {, n, B, y}, with fast decay in the auto- 
correlations at successive lags in the MCMC iterations. The posterior summary in Table 
4.5 is based on 5000 iterations with 500 burn in. 

The parameter estimates for Model A show that working class respondents are less 
likely to give the unrestricted view; and that women as compared to men, and Catholics 
and ‘other religions’ as compared to Protestants and non-believers are also less likely to 
give the unrestricted view. Living in a ‘Protestant’ area also boosts the chance of giving 
a ‘no legal restriction’ response. 

A second model (Model B) introduces random variation at subject and district levels 
(levels 2 and 3). So 


Yin ~ Bern(7 jj) 
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Table 4.5 British social attitudes 


Mean St. devn. 2.50% 97.50% 


Model A (independence model) 


Log Likelihood —628 2 —633 —624 


Intercept —0.79 0.28 —1.34 —0.24 
Year 2 —0.42 0.20 —0.80 —0.04 
Year 3 0.04 0.19 —0.33 0.41 
Year 4 0.18 0.19 —0.19 0.56 
Subject Effects 

Upper Working Class —0.31 0.19 —0.68 0.06 
Lower Working Class —0.42 0.16 —0.74  —0.09 
Gender —0.27 0.14 —0.55 0.00 
Protestant —0.43 0.32 —1.08 0.18 
Other Religion —0.60 0.24 —1.09 —0.13 
No religion 0.70 0.18 0.36 1.05 
Cluster (District) Effects 

Protestant 0.80 0.29 0.25 1.37 


Model B (Level 2 & 3 homogenous errors) 


Log Likelihood —388 12 —412 —365 
Intercept —1.28 0.55 —2.39 —0.20 
Year 2 —0.73 0.26 —1.24 —-0.23 
Year 3 0.07 0.24 —0.41 0.54 
Year 4 0.30 0.24 —0.16 0.78 
Subject Effects 

Upper Working Class —0.51 0.35 —1.19 0.20 
Lower Working Class —0.38 0.34 —1.05 0.29 
Gender —0.53 0.35 —1.26 0.15 
Protestant —0.56 0.63 —1.82 0.66 
Other Religion —0.94 0.50 —1.93 0.02 
No religion 1.07 0.40 0.30 1.86 
Cluster (District) Effects 

Protestant 0.95 0.61 —0.27 2.11 
Random Effects St devns 

Level 2 2.28 0.25 1.82 2.81 
Level 3 0.53 0.40 0.01 1.33 


Model C (Level 2 heteroscedasticity) 


Log Likelihood —387 12 —411 —365 
Intercept —1.25 0.59 —2.38 —0.06 
Year 2 —0.74 0.25 —1.23 —0.24 
Year 3 0.07 0.24 —0.39 0.54 
Year 4 0.31 0.24 —0.15 0.78 
Subject Effects 

Upper Working Class —0.48 0.39 —1.25 0.30 
Lower Working Class —0.22 0.37 —0.95 0.51 
Gender —0.65 0.35 —1.33 0.02 
Protestant —0.55 0.59 —1.72 0.61 


Other Religion —1.14 0.51 —2.16 —0.13 
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Model C (Level 2 heteroscedasticity) (continued) 


No religion 0.77 0.41 —0.02 1.58 
Cluster (District) Effects 

Protestant 1.07 0.61 —0.15 2.22 
Level 2 St devns 

Middle 2.95 0.61 1.91 4.29 
Skilled Working 2.94 0.57 1.98 4.22 
Unskilled 1.77 0.29 1.27 2.40 


Level 3 (District) Standard devn 0.40 0.35 0.01 1.22 


logit(mj) = n, BX; YZ; uj a 
where uj and a; are normal random effects with homogenous variances 1, and Ta- 
Following Heagerty and Zeger, none of the slopes B are taken to vary randomly. Initially, 
G(0.0001, 0.0001) priors are assumed on 1/7, and 1/Ta. N(0, 10) priors are taken on the 
intercept and the effects B, y and n relating to year, percents Protestant, class, etc. Three 
parallel chains are run for 5000 iterations; convergence is apparent at under 500 iter- 
ations, and the posterior summary in Table 4.5 is based on iterations 500-5000. 

Under the prior assumptions just mentioned, there is a pronounced gain in likelihood 
over Model A, though a full assessment would require penalising for the additional 
random effects and variance parameters (see the Exercises). The absolute effects B, 
in terms of posterior means, of most of the categorical variables (class, sex, religion) are 
enhanced, though generally with lower precision (i.e. posterior standard deviations 
are higher for the B coefficients in Model B than Model A). Little difference was 
made running a more informative G(5, 1.5) prior for 1/7, following the example of 
Johnson and Albert (1999). The posterior mean estimate for t, was reduced slightly 
to 2.15. 

A third model (Model C) allows for heteroscedasticity at respondent level such that 
the uj; have variances at level 2 which are specific for the social class C,; of the subjects. 
Since Cj has three levels there are three possible values for oF. So log (;), namely the 
logs of the level 1 precisions 6; = oF, could depend upon social class via a regression 
containing an intercept and effects for the second and third social class groups. Alter- 
natively, we here adopt the strategy in Equation (4.8), with 


by aus G(a,, bı), > a Gla, bo), 3 D G(a3, b3) 


and a; = b; = 0.0001 for all i. 

The analysis is based on three parallel chains over 5000 iterations, and shows that the 
unskilled working class respondents as most homogenous (least variable) in their views 
over districts. The direct regression effect of social class on attitudes loses further 
definition as against Models A and B, with the posterior standard deviation for the 
lower working class now 50% larger than the posterior mean of —0.22. 


4.4 ROBUSTNESS IN MULTI-LEVEL MODELLING 


Issues of robustness in multi-level modelling occur especially in comparing results 
of fully Bayesian estimation and ML estimation involving generalised least squares 
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techniques (Langford and Lewis, 1998). The latter may condition on estimates of 
variance components which are treated as known, so obscuring the potential impact 
of outliers at different levels. Consider a standard two level hierarchical model with 
j=1,..J clusters at level 2, and i= 1,..n; cases within each cluster. Then for a 
continuous outcome Y, consider a level 1 model with cluster specific regression effects 


Yj =B XG uy (4.9) 


where the uj have mean 0 and variance o”, and P, and Xj; are of length p 1, with 
Xj; = 1. At level 2 each of the stochastic coefficients from level 1 are related to q cluster 
level predictors Z;. For the Ath such coefficient (h = 1,..p 1), we might have 


Bn=Ziy, enm j=1,..J (4.10) 


where y, is a vector of regression coefficients. 

Under typical assumptions that uj and ej are Normal, both the level 1 coefficients 
and the regression coefficients at level 2 may be sensitive to outliers. Estimates and 
inferences regarding the effects y, are especially subject to this if J is small. Thus, 
consider the posterior for y, under these standard assumptions, conditional on the 
variance g? at level 1, and on the (p 1)x(p 1) dispersion matrix V for the errors 
e at level 2, and assuming a uniform prior for y. Then the posterior is multivariate 
normal of dimension q x (p 1) 


yly, V, o cx N(G, D) 


where 


J 
D'=X WG VWI; 
j=l 


J 
G=D\ WG Vy'B, 
j=l 
G = XXN 
and where ĝ; is the least squares estimate of B,, namely 
By = GX) Xj 


Many multi-level procedures involve maximum likelihood or EM procedures for esti- 
mating V and ø? which are then treated as known. The resulting estimates of G and D, 
and hence y, will then not account for the uncertainty in estimating V and o°, so that the 
intervals on the components of y will tend to be too narrow (Seltzer, 1993). 

There may also be sensitivity regarding the assumed form of the cluster effect 
covariation and the observational errors u, even if uncertainty in their estimation is 
allowed for. Under the standard Normal assumptions regarding error terms at different 
levels, outlying data points, especially at level 2 and above, may unduly influence model 
parameter estimates and distort credible intervals tending to make them too wide. 
Options for robust estimation, especially of cluster disturbances, include discrete mix- 
tures (Rabe-Hesketh and Pickles, 1999) and Student ¢ errors with unknown degrees of 
freedom (Seltzer, 1993). 

Browne and Draper (2000) consider alternatives to the standard non-informative 
choices for the prior on the precision matrix of variable cluster effects (intercepts and 
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p Slopes varying at level 2 when the level 1 relates to subjects). Thus, for p 1 effects 
varying randomly at cluster level, a standard choice under multivariate Normal or 
Student ¢ sampling for such effects is that V~! is Wishart with p 1 degrees of freedom 
and an identity scale matrix. Browne and Draper consider an option whereby 
the degrees of freedom are (p 1) 2 and an estimated scale matrix (e.g. from an 
iterative GLS procedure) replaces the default identity matrix. Daniels and Kass (1999) 
consider further options, for instance where the degrees of freedom is an unknown 
parameter, or the element on the diagonal of the prior scale matrix is allowed to be a 
free parameter. 

It may be noted that individual membership probabilities in the discrete mixture 
model may be related to explanatory variates, in a way that parallels regressions of 
(continuously varying) cluster slopes and intercepts on cluster variables. Thus Carlin 
et al. (2000) consider robustness from a Bayes perspective in a multi-level model for 
panel data on smoking in adolescent subjects: such subjects are then ‘clusters’ at level 2. 
With a binary outcome, Y; at time ¢ for subject i, they contrast the ‘logistic-normal’ 
model, namely normal subject effects, with a two group mixture of subjects. Group 
membership is related to a set of fixed subject level covariates W;. In general, one might 
specify 


Yy ~ Bern(ry) 
logit(mi) = BL, Xir 

L; ~ Categorical(P;) 
logit(Pi1) = 6W; 


Here L; is the latent group membership aiming to distinguish between a high risk or 
susceptible group (in terms of smoking level) and a low risk group so P; is of dimension 2. 
In fact, Carlin et al. take the probability of smoking in one ‘non-susceptible’ group as 
zero, and only apply a logit regression for m; for subjects falling in a susceptible group. 


Example 4.5 JSP Project: Maths over time To illustrate sensitivity issues in the 
specification of cluster effect covariation, consider the Joint Schools Project data, also 
analysed by Mortimore et al. (1988), and more recently in an extensive sensitivity 
analysis by Browne and Draper (2000). Here the pupils are level 1 and the clusters are 
schools; the data include differential weights on pupils, which affect the specification of 
the variance term. The model considers Maths attainment at year 5 in relation to such 
attainment at year 3, with 


MATHS, =o, B;MATH3y uy (4.11) 


The attainment continuity effects B, and the intercepts are initially taken as bivariate 
normal over the 48 schools (Model A), with dispersion matrix V, while the level 1 errors 
uj are taken as normal with variance o°. To illustrate sensitivity with regard to prior 
assumptions on random cluster effects, we consider two variations towards robust 
inference. The first is a discrete mixture of intercept and attainment effects (Model B), 
with the a; and B, in Equation (4.11) being determined by a latent categorisation of the 
48 schools into M groups. Following Rabe-Hesketh and Pickles (1999), a discrete 
mixture with M = 3 groups of schools is assumed. The second is nonparametric hier- 
archical model using a truncted Dirichlet process prior (Ishwaran and Zarepour, 2000). 
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Estimation in Model A is based on three parallel chains with the summaries based on 
the last 24 000 from 25 000 iterations, since early convergence was apparent. In Model B 
there is later convergence and the last 15000 of 25 000 iterations (over three chains) are 
used for the summary. In Model C the summary is based on the second half of a 10000 
iteration run with three chains. 

The estimates for Model A show a negative correlation in the bivariate Normal for 
the school effects, averaging —0.46, between average school attainment and persistence 
in attainment. The persistence effect itself (the impact of Maths 3 scores on Maths 5 
scores) is demonstrated by an average coefficient B of 0.62, with 95% credible interval 
from 0.50 to 0.74. The school persistence effects, measured by posterior means, vary 
between 0.24 (school 37) and 1.20 (school 43). 

As to the discrete mixture model, this may be stated as 


MATHS, =a, B,,MATH3,; uy (4.12) 


where L; is the group membership of the school. Rabe-Hesketh and Pickles estimate 
group specific persistence coefficients B; = 0.68, B, = 0.77 and B} = 0.34. In their 
analysis, these groups have respective probabilities 0.54, 0.19 and 0.28. Similar findings 
are obtained here in the analysis of Model B, adopting a mildly informative Dirichlet 
prior on the mixture proportions for M = 3, with prior weights of 2.5 on each compon- 
ent. Also, the prior on the attainment effects B, in the three latent school groups is 
constrained to be increasing to improve identifiability (so the groups are in ascending 
order of persistence effect). With a relatively small number of schools at level 2, less 
informative priors may lead to identifiability problems unless a data dependent prior 
(see Chapter 2) is employed. 

The largest of the M = 3 attainment coefficients stands at B} = 0.80, and the smallest 
at 0.35 (Table 4.6). In estimation there was some delay in the convergence of the 
intercepts a in Equation (4.12). There were also high sampling autocorrelations for 
the group intercept parameters, especially those for the second and third groups (those 
with the larger persistence effects), suggesting the group intercepts are less well separ- 
ated than the persistence effects. Subsampling (every fiftieth iteration) makes little 


Table 4.6 Parameter summaries, Maths attainment 


Mean St. devn. 2.5% Median 97.5% 
Model A 
Log Likelihood —2603 6.6 —2618 —2603 —2592 
Corr(aj, B;) —0.46 0.16 —0.73 —0.47 —0.10 
o? 28.1 1.4 25.4 28.0 31.0 
Level 2 covariance 
Vil 4.45 1.40 2.28 4.25 7.72 
Vio —0.31 0.15 —0.66 —0.30 —0.06 
Va 0.10 0.03 0.06 0.10 0.17 
Average effects 
Intercept 30.59 0.37 29.86 30.60 31.31 


Math3 0.62 0.06 0.50 0.62 0.74 
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Table 4.6 (continued) 


155 


Mean St.devn. 2.5% Median 97.5% 
Model B 
Actual Log —2660 7 —2673 —2660 —2648 
Likelihood 
Complete Log —2613 5 —2625 —2612 —2604 
Likelihood 
Math3 effects 
Group 1 0.35 0.09 0.22 0.35 0.47 
Group 2 0.66 0.06 0.51 0.67 0.76 
Group 3 0.80 0.11 0.66 0.78 1.05 
Intercept 
Group 1 33.4 6.2 32.4 33.5 34.6 
Group 2 29.4 1.6 26.5 30.0 32.9 
Group 3 28.6 1.6 26.0 28.3 30.9 
Mixture Proportions 
TI 0.30 0.08 0.16 0.30 0.47 
T2 0.41 0.15 0.11 0.44 0.66 
T3 0.29 0.15 0.08 0.25 0.61 
Model C 
Log Likelihood —2609 7 —2625 —2609 —2597 
Corr(aj, B;) —0.71 0.15 —0.99 —0.71 —0.38 
o? 28.43 1.44 25.70 28.39 31.37 
Mixture Parameters 
k (Dirichlet 
precision) 3.16 3.22 0.29 2.00 12.03 
Average Number 
of Clusters 6.3 1.8 3 6 9 
Average effects 
Intercept 30.67 0.20 30.29 30.67 31.06 
Math3 0.62 0.03 0.55 0.62 0.69 


difference to the posterior means on the group intercepts, though the standard errors 


are altered to some degree. 


The likelihood under Model B deteriorates as compared to the bivariate Normal 
random effects model, though it is likely to be less heavily parameterised. The ‘complete 
data’ likelihood (assuming group memberships known) is more comparable to the 


bivariate Normal random effects model. 


The DPP nonparametric model assumes a maximum 10 clusters for the intercepts and 
slopes and adopts a baseline bivariate Normal prior for the covarying cluster intercepts 
and slopes. The baseline dispersion matric has prior as in model A. The Dirichlet 
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precision parameter k is updated as in Ishwaran and Zarepour (2000). Convergence is 
obtained in a two chain run after 1250 iterations with the mean number of clusters in a 
run of 5000 iterations being 6.3, and the Dirichlet parameter estimated at 3.16. The 
smoothed school effects on continuity (Table 4.7) are intermediate between the bivariate 
Normal parametric model and three-group discrete mixture. 


4.55 MULTI-LEVEL DATA ON MULTIVARIATE INDICES 


Frequently, profiling or performance rankings of public sector agencies (schools, 
hospitals, etc.) will involve multiple indicators. Inferences about relevant summary 
parameters such as comparative ranks, or the probability that a particular institution 
exceeds the average, are readily obtained under the Bayes sampling perspective 
(Deely and Smith, 1998). Such inferences will often be improved by allowing for the 
interplay between the indicators themselves, and also for features of the institutions (e.g. 
the case-mix of patients in health settings or intake ability of pupils in school compari- 
sons) which influence performance on some or all of the indicators used. Similar gains 
in precision may occur in small area health profiling, where multiple mortality or 
morbidity outcomes provide a firmer basis for defining health problem areas than a 
single outcome. 

Suppose individual level data Yj, are available for variables h = 1,.. H, clusters 
j=1,...J and subjects i = 1,..n; within each cluster. Then the measurements on the 
different variables can be envisaged as the lowest level (level 1) of the data hierarchy, 
in the same way as repeated measures on the same variable are treated. The subjects 
are at level 2, and the clusters (agencies, areas, etc.) at level 3. Thus for a set of 
H metric outcomes, taken to be multivariate Normal, and with M predictors 
Xiim, M = 1,.. M(X; = 1) at subject level, one might propose 


Yy ~ Ny (ny, x) 


Bin = Ba BroXi2 BasXy3  -- Baw Xi (4.13) 


with Yj = (Yj, Yi2,---, Yi) and = an H x H dispersion matrix’, 

In Equation (4.13) intercept and predictor effects are specific to dependent variable h, 
and do not pool strength over clusters. Another option, therefore, allows random 
variability over clusters, with 


Mijn = Bin = BinoXy2 Bin Xim 


The random effects Bm might then be related to cluster attributes within a multivariate 
density of dimension H x M, or separate densities (for each outcome A) of dimension M. 
In the same vein, multivariate data aggregated over individuals within clusters (i.e. 
averaged over the n; subjects) constitute a form of multi-level data. The lowest level of 
the analysis models the observed outcomes are vectors of length H for agencies, or more 
generally clusters, j = 1,..J 
Yj = Ors Vj YH) 


4 In BUGS this might be expressed in the code 
for (j in 1:J) {for (i in 1:n[j) {Y[ij,1:H] ~ dmnorm(mu{i,j, 1:H],T[1:H,1:H])}} 
where T[1:H,1:H] is the inverse dispersion. 
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Table 4.7 School effects on attainment continuity (B,, By,... Bug) 


Schoolno. Mean effect 2.5% 97.5% Mean effect 2.5% 97.5% Mean effect 2.5% 97.5% 


Model A Model B Model C 
S1 0.60 0.32 0.88 0.74 0.59 0.90 0.68 0.48 0.86 
S2 0.45 0.06 0.80 0.59 0.28 0.78 0.56 0.25 0.80 
S3 0.53 0.04 1.01 0.49 0.25 0.77 0.51 0.25 0.87 
S4 0.89 0.54 1.26 0.77 0.62 0.98 0.77 0.56 1.24 
S5 0.59 0.20 0.99 0.66 0.35 0.80 0.65 0.33 0.88 
S6 0.50 0.12 0.89 0.56 0.27 0.76 0.54 0.26 0.82 
S7 0.65 0.22 1.08 0.52 0.26 0.77 0.55 0.27 0.97 
S8 0.70 0.40 1.00 0.67 0.52 0.79 0.68 0.44 0.92 
S9 0.83 0.42 1.28 0.78 0.60 0.99 0.72 0.51 1.05 
S10 0.71 0.32 1.10 0.65 0.35 0.79 0.66 0.35 0.99 
S11 0.71 0.39 1.04 0.69 0.57 0.82 0.70 0.50 0.93 
S12 0.36 0.03 0.69 0.65 0.34 0.77 0.56 0.18 0.78 
S13 0.68 0.29 1.09 0.69 0.57 0.84 0.69 0.46 0.97 
S14 0.48 —0.03 0.98 0.50 0.25 0.80 0.50 0.23 0.84 
S15 0.82 0.42 1.22 0.70 0.58 0.85 0.74 0.53 1.20 
S16 0.77 0.29 1.25 0.65 0.32 0.87 0.70 0.32 1.27 
S17 0.65 0.30 1.00 0.68 0.55 0.82 0.68 0.43 0.89 
S18 0.44 0.06 0.83 0.45 0.24 0.74 0.44 0.23 0.75 
S19 0.67 0.18 1.18 0.68 0.34 0.88 0.67 0.32 1.04 
S20 0.67 0.33 1.01 0.79 0.61 0.99 0.70 0.48 0.95 
S21 0.77 0.36 1.19 0.74 0.59 0.94 0.73 0.52 1.08 
S22 0.98 0.67 1.31 0.72 0.60 0.91 0.86 0.61 1.38 
S23 0.37 0.03 0.71 0.35 0.23 0.48 0.37 0.22 0.52 
S24 0.30 0.00 0.59 0.36 0.23 0.50 0.36 0.21 0.53 
S25 0.44 0.09 0.79 0.51 0.26 0.75 0.48 0.24 0.76 
S26 0.75 0.34 1.19 0.67 0.40 0.80 0.70 0.38 1.10 
S27 0.74 0.31 1.14 0.74 0.60 0.93 0.72 0.52 1.05 
S28 0.73 0.30 1.17 0.78 0.60 0.99 0.71 0.46 1.00 
S29 0.51 0.15 0.88 0.56 0.29 0.76 0.53 0.26 0.81 
S30 0.47 0.19 0.75 0.35 0.23 0.48 0.37 0.23 0.54 
S31 0.63 0.33 0.92 0.67 0.47 0.79 0.65 0.40 0.84 
S32 0.60 0.36 0.83 0.69 0.57 0.81 0.66 0.48 0.82 
S33 0.46 0.10 0.83 0.35 0.23 0.48 0.37 0.22 0.54 
S34 1.00 0.55 1.44 0.73 0.59 0.95 0.91 0.57 1.64 
S35 0.41 0.05 0.78 0.35 0.23 0.48 0.37 0.22 0.54 
S36 0.49 0.11 0.88 0.58 0.28 0.77 0.55 0.26 0.82 
S37 0.24 —0.14 0.60 0.35 0.23 0.48 0.36 0.18 0.50 
S38 0.60 0.12 1.11 0.71 0.52 0.90 0.68 0.37 0.97 
S39 0.78 0.49 1.09 0.77 0.61 0.95 0.73 0.55 1.00 
S40 0.37 0.00 0.74 0.37 0.23 0.65 0.38 0.22 0.64 
S41 0.60 0.31 0.89 0.68 0.56 0.79 0.67 0.47 0.84 
S42 0.70 0.23 1.19 0.67 0.32 0.89 0.67 0.32 1.07 
S43 1.20 0.81 1.61 0.80 0.64 1.01 1.29 0.68 1.91 
S44 0.56 0.09 1.03 0.58 0.26 0.81 0.57 0.26 0.91 
S45 0.35 0.08 0.61 0.36 0.23 0.60 0.37 0.22 0.57 
S46 0.77 0.56 0.97 0.69 0.58 0.81 0.72 0.57 0.93 


S47 0.90 0.60 1.20 0.69 0.58 0.85 0.79 0.58 1.21 
S48 0.26 —0.16 0.64 0.38 0.23 0.68 0.38 0.19 0.68 
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conditional on parameters of a hyperprior. For example, for continuous data one might 
be able to assume multivariate normality 


y; ~ Nu(8;, Vj) (4.14) 


where 0; is a vector of true averages on the H outcomes. Often the measurement 
dispersion matrices V; will be known, and take account of varying sample sizes nj. 

A random effects model might then be used to predict the ‘true’ mean outcome 6j, on 
index h in agency j on the basis of a multivariate population hyperprior. This might 
involve cluster level covariates 


{zm k=1,..K,j=1,..J} 


with z; = 1 for all j, to predict the means in this density. Suppose, following Everson 
and Morris (2000), the 0;, are assumed to be functions of covariates z in a two stage 
multivariate model, with the V; in Equation (4.14) known. Then at stage 2 


0; ~ Ni(v;, X) (4.15) 
and the regression means vj, of the 0j, are specified as 


Va =Y  Y2Zj2 Y13733 +» Vi KZjK 
ViH = Vu. VH22j72 VH32j73 -++ VHKZjK 


Example 4.6 Hospital profiling In the hospital profiling application of Everson and 
Morris (2000), there are H = 2 percentage rates of patient-reported problems for J = 27 
hospitals: one rate relates to surgical issues, the other to non-surgical issues. There is a 
severity index Z; for each hospital, with higher levels of the index denoting more 
complex patient case-mix. Following Everson and Morris, a Normal approximation 
to the binomial is assumed and the measurement dispersion matrices V; are 
known. Further it is assumed that measurement errors in Equation (4.14) are uncorrel- 
ated (so that V; is diagonal), so restricting the modelling of correlations to the latent 
effects 0;. 


The variances, across hospitals and patients, for yj, and yj are taken as supplied by 
Everson and Morris (2000, p. 405), namely 148.9 and 490.6, so that the first stage model 
is 


ya ~ NO, Va) 
yjz ~ NO2, Vn) 


where V; = 149/n;, Vp = 491/n;. The latent means are modelled according to Equa- 
tion (4.15). 

Running three parallel chains for 50000 iterations, early convergence is apparent (at 
under 500 iterations) on the hyperdensity parameters and underlying means; however, 
sampling autocorrelations are quite high on the y parameters in Equation (4.15). 
However, comparing summaries based on 25000 and 50000 iterations shows little 
difference on posterior summaries for these parameters. Table 4.8, based on 50000 
iterations, shows that the impact of the severity index on the surgical problem rate for 
hospitals y» is clearly positive, but that on the non-surgical rate y,, is less clear. There 
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Table 4.8 Bivariate hospital outcomes analysis (h = 1, non-surgical; h = 2, 


surgical) 

Parameter Mean St. devn. 2.5% 97.5% 
covariance matrix 

Xi 3.78 1.79 1.21 8.29 
Xiz 2.43 1.61 0.07 6.17 
Dy) 2.84 2.31 0.34 9.05 


Intercepts and Severity effects 


Yi 12.29 1.23 9.96 14.97 
Y1 12.68 1.47 9.52 15.46 
yi2 1.76 2.34 —3.29 6.40 
Yn 5.68 2.80 0.24 11.46 


Smoothed rates (posterior means and standard deviations) 


Non-surgical Mean St. devn. Surgical Mean St. devn. 
81, 12.3 1.6 81,2 16.1 1.6 
02, 12.8 1.4 02,2 16.0 1.4 
03, 14.3 1.5 03,2 16.7 1.5 
04, 12.5 1.3 84, 2 14.1 1.4 
05, 13.3 1.4 05,2 17.5 1.7 
06, 12.7 1.3 06, 2 15.2 1.3 
87, 14.1 1.3 07,2 15.9 1.3 
Qs, 13.1 1.2 Og, 2 15.6 1.3 
89, 12.4 1.2 Qo, 2 14.1 1.3 
810, 10.2 1.3 810, 2 13.1 1.6 
ði, 16.7 1.4 O11, 2 18.6 2.0 
812, 12.5 1.2 O12, 2 14.5 1.3 
813, 12.6 1.2 813, 2 14.1 1.3 
O14, 14.0 1.2 014, 2 16.5 1.3 
815, 14.1 1.2 O15, 2 14.6 1.3 
816, 14.5 1.2 816, 2 17.1 1.3 
617, 11.7 1.2 017, 2 12.5 1.5 
818, 15.2 1.2 O18, 2 16.2 1.3 
819, 15.4 1.2 819, 2 17.1 1.3 
820, 11.7 1.1 020, 2 13.8 1.2 
81, 14.1 1.1 O21, 2 16.4 1.2 
829, 11.5 1.2 82, 2 15.6 1.5 
023, 14.7 1.1 023, 2 16.2 1.2 
O04, 12.8 0.9 O24, 2 16.0 1.2 
025, 11.5 1.0 025, 2 14.8 1.1 
026, 10.6 0.9 026, 2 11.5 1.3 
807, 13.8 0.8 027, 2 17.4 1.1 


is a clear positive correlation between the two latent rates 0; and taking account of this 
will improve precision in the estimated rates. Compared to the original surgical rate 
data, which range from 9-27%, the smoothed surgical problem rates vary from 
11.5-18.5%. 
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Taking account of the uncertainty in estimating © may mean that Bayes estimates are 
less precise than REML estimates that assume a known dispersion matrix. Everson and 
Morris present evidence that the REML estimates of 0;, show undercoverage in the 
sense of being too precise, and that the Bayes estimates show better coverage of the true 
distribution of these latent rates. 

On the other hand, Bayes estimates are more precise than the classical intervals based 
on no pooling, namely via fixed effects maximum likelihood: the classical intervals 
for 2 (surgical problem rates) are 51% wider than the Bayesian estimates obtained 
by Everson and Morris. Here variances on 6;2 are obtained which are intermediate 
between the REML results and the estimates obtained by Everson and Morris (2000, 
Figure 2). 


Example 4.7 Lung cancer death trends: bivariate Poisson outcomes This example uses 
data for 508 State Economic Areas in the US relating to male and female lung cancer 
deaths in the periods 1950-69 and 1970-94. The SEAs are nested within 51 states 
(including the District of Columbia) and eight regions. Of interest in explaining vari- 
ability in the latter period are the persistence of area mortality differences from the 
earlier period, the impact on cancer mortality of economic indicators, and mortality 
differences by state and regional location of the SEAs. As economic indicators average 
incomes per head in 1982 (relative to the US average) are used, and the possible lagged 
impact of the same variable defined for 1960 may also be considered. 


We define separate Poisson means for the two outcomes and relate them to state 
varying intercepts. Extensions to this model allow for the influence of past mortality 
rates and include regression slopes on the log(SMR) in 1950-69. The first model 
(Model A) for the correlation between the two SEA mortality outcomes in the later 
period is then (with = 1, 2 for males and females) 


Yin ~ Poilu) 


log (mn) = log (En) Bs, n 


where Ej, = expected deaths (from demographic standardisation) and S; = state to 
which SEA 7 belongs. This is a two-level model for each mortality outcome with 
intercepts varying according to the state that the SEA is located in. This arrangement 
of the data involves a vector S[] of length T, = 508 containing the state indicators for 
the SEAs. A nested arrangement might also be used, but would entail a square array of 
dimension [51, Q], where Q is the largest number of SEAs in a state. In a BUGS 
program the data would have to be padded out by NA values for states containing 
fewer than Q Economic Areas. 

It is assumed that the pairs of intercepts B; and Bj», j = 1, .. 51, are bivariate Normal 
with regression means v;, determined by state incomes, RING; expressed relative to the 
US average. Then 


Y= Y2RING; 
v2 =% Y2RINC; 


Estimates for this model (Table 4.9) are based on three chains taken to 5000 iterations, 
with 500 burn-in. They show a positive effect of income on male cancer levels — this 
might be seen as countering subject matter knowledge whereby relative economic 
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Table 4.9 Lung cancer deaths in state economic areas 


161 


Model A Mean St. devn. 2.5% 197.5% 
Correlation, male & female cancer rates 0.509 0.102 0.289 0.689 
Likelihood (model for males) —10910 5.058 —10920 —10900 
Likelihood (model for females) —8851 5.007 —8862 —8842 
Intercept Males (y,,) —0.559 0.182 —0.896 —0.204 
Income Effect Males (y12) 0.520 0.185 0.157 0.861 
Intercept females (y>,) 0.204 0.211 —0.224 0.572 
Income effect females (y>,) —0.247 0.214 —0.624 0.191 
Model B Mean St. devn. 2.5% '97.5% 
Intercept Males —0.078 0.165 —0.377 0.233 
Income Effect Males 0.095 0.170 —0.223 0.400 
Persistence Effect Males 0.618 0.058 0.506 0.726 
Intercept females 0.449 0.154 0.156 0.748 
Income effect females —0.443 0.155 —0.751 —0.149 
Persistence Effect females 0.360 0.045 0.270 0.450 
Correlation, male & female levels 0.238 0.129 —0.034 0.486 
Correlation, male levels and persistence —0.002 0.143 —0.287 0.287 
Correlation, female levels and persistence —0.034 0.140 —0.305 0.226 
Likelihood (model for males) —7481 6.135 —7495 —7470 
Likelihood (model for females) —5430 6.713 —5444 —5418 
Model C Mean St. devn. 12.5% 197.5% 
Region (Males) 

East —0.042 0.108 —0.226 0.126 
G. Lakes —0.042 0.109 —0.225 0.139 
C. North —0.064 0.119 —0.264 0.127 
S. East 0.036 0.100 —0.126 0.199 
S. West —0.068 0.112 —0.252 0.113 
Rocky Mts —0.143 0.114 —0.329 0.044 
Far West 0.072 0.110 —0.118 0.246 
Region (females) 

East —0.040 0.110 —0.219 0.136 
G. Lakes 0.004 0.115 —0.185 0.191 
C. North —0.084 0.119 —0.285 0.104 
S. East 0.116 0.102 —0.045 0.295 
S. West —0.035 0.119 —0.228 0.155 
Rocky Mts —0.144 0.112 —0.323 0.043 
Far West —0.049 0.112 —0.234 0.130 
Other parameters 

Intercept Males —0.162 0.242 —0.538 0.250 
Income Effect Males 0.203 0.233 —0.186 0.575 
Persistence Effect Males 0.612 0.060 0.517 0.710 


(continues) 
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Table 4.9 (continued) 


Intercept females 0.216 0.249 —0.239 0.580 


Income effect females —0.191 0.238 —0.532 0.231 
Persistence Effect females 0.357 0.047 0.278 0.436 
Correlation, male & female levels 0.171 0.151 —0.082 0.410 
Correlation, male levels and persistence 0.065 0.182 —0.240 0.362 
Correlation, female levels and persistence 0.127 0.152 —0.127 0.370 
Likelihood (model for males) —7481 6.995 —7493 —7470 

Likelihood (model for females) —5430 6.888 —5442 —5419 


hardship is associated with worse health and mortality. The two outcomes have a 
correlation of around 0.5 in this model. 

Two extended models (Models B and C) allow for persistence of mortality differences. 
Model B takes the form 


Yin ~ Poi(wy,) 


log (mn) = log (En) Bs, a,i Bs; h, 2Min 


where M;n is the maximum likelihood SMR for male and female lung cancer (h = 1, 2) 
in SEA i in 1950-69 in the earlier period. The parameters B; ;, therefore express 
continuity. 

For simplicity of notation, this model is written as 


log (m1) =log(En) bs„ı bs, 2Mii 
(4.16) 
log (mp) = log (En) bs„3 bs, 4Mni 


The state-specific regression effects are modelled as multivariate Normal of order 4 
(there are H = 2 outcomes and K = 2 predictors with cluster specific effects), in which 
parameters hs, ı and bs,, 3 (the mortality level parameters) are related to state incomes 
RINCs,. So for states j = 1, 51, the regression means are modelled as 


Vi,1= Vu Yi2RING; 


Vj,2 = Y21 í 
4.17) 
Vi, 3 = Y31 Y32RINC; 


Vj, 4 = Y4 


The coefficients y2; and y4; for bs,, 2 and bs,, 4 represent persistence effects. We evaluate 
this model, where the intercepts bs, ı and bs,, 3 depend only upon state incomes RING} 
in 1970-94, against an alternative which adds region in Equation (4.17); these are 
Models B and C, respectively. Note that examination of the three chain trace plots 
and Gelman—Rubin statistics for the log-likelihood and parameters shows convergence 
only after about 500-750 iterations for these models. 

By contrast to Model A, Model B shows a significant effect, in the parameter y3 in 
Equation (4.17), of state incomes on female but not male cancer. The highest female 
rates at SEA level are in the lowest income states, more in line with subject matter 
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knowledge. The average slopes y); and y4 relating later to earlier SMRs in Equation 
(4.16) are 0.62 for males and 0.36 for females, showing that male differences are more 
enduring. The two outcomes themselves are moderately correlated in this model 
(median 0.24 with credible interval from —0.03 to 0.49), but there is no correlation 
between the degree of cancer persistence and cancer levels. 

Adding region to Model B to explain differences in SEA mortality (in Model C with 
New England as base region) attenuates the income effects, expressed by y; and y3,, while 
clear region effects are mostly absent. The average likelihoods show no improvement over 
Model B and penalising for the extra 14 parameters would therefore show a clear deterior- 
ation in fit. There is a slight effect for the Rocky Mountain region (lower rates for both 
female and male mortality), but the credible interval is not confined to negative values. A 
weakly positive effect for the South East region is apparent on female mortality. 


4.6 SMALL DOMAIN ESTIMATION 


Survey data are often cost effective ways of obtaining information on a wide variety of 
topics and at frequent intervals in time. However, they may become sparse in terms 
of deriving social or health indicators for sub-populations defined by combinations of 
characteristics, generically termed ‘small areas’ or ‘domains’. For example, a domain 
might be defined by stratifying on demographic survey variables such as age, locality, 
sex and ethnicity. 

One may seek to make inferences about the average outcome in a domain based on a 
survey or some other sample, where this survey includes few units, perhaps none, from 
that domain itself. The direct survey estimates, even if design weighted, are likely to 
have low precision because sample sizes are not large enough at the domain level. Small 
area estimation describes a set of empirical and fully Bayes hierarchical regression 
procedures to combine survey information over similar small areas and to make 
inferences for the total domain populations (Ghosh and Rao, 1994). These methods 
may include ancillary information on the small areas from other sources (e.g. Census 
data) to improve the pooling. 

Thus, Folsom et al. (2000) describe how small area estimation is used in designing 
health promotion interventions by linking national and state survey outcome data (on 
disease and health behaviours) with local area predictors, such as non-survey indicators 
of social and age structure, in order to make local estimates of prevalence. Random 
effect small area estimation models have also been applied to discrete outcomes from a 
Bayesian pespective. For example, Nandram et al. (1999) report small area estimates of 
mortality and Malec et al. (1997) consider estimation of the probability of visiting a 
doctor in the last year, using data from the United States National Health Interview 
Survey. 

Let yix be a univariate response for subject i in class or domain k and cluster j, where 
j=1,..J,k=1,..K andi=1,..njy. The class might be defined by demographic attri- 
butes (e.g. sex, age band, or ethnic group) while the cluster might be a geographic area. 
Let x; denote a set of p categorical predictors assumed to be the same for all subjects in 
class k, regardless of cluster. Thus, if p = 2 and x; were sex (1 = male, 0 = female) and x2 
ethnicity (1 = non-white, 0 = white), then a white male would have xı = 1 and x, = 0. 
Also, let z; be q cluster level variables (e.g. average county or state incomes). Let 
P; = {Bj1.---B,,} denote varying slopes over clusters on the categoric variables. Then 
for a Gaussian outcome with random intercepts and slopes over all predictors, 
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Vijk = Qj B:xk Uijk 
J 


where uj ~ N(0, 0°). Setting 6; = {a;, P;}, then random variation in intercepts and 
slopes may be related to cluster variables z; via a multivariate normal model 


d= yz © e~ Np 1(0, 2) 
or equivalently, 
3; ~ Np 1 (yz, È) (4.18) 


with y of dimension (p 1) by q. 
Other possibilities include the random intercepts model (Moura and Holt, 1999; 
Battese et al., 1988): 


Vik =j BXE UK 


and models where domain and cluster are conflated into a single index j, and each 
survey unit is characterised by continuous predictors x, in a model such as (Hulting and 
Harville, 1991) 


Vp = OQ BG Uy (4.19) 


For a binary outcome (e.g. long term ill or not) defined by cluster and domain, we might 
have 


Vik ~ Bern(rijk) (4.20) 


logit(ma) =% B;xk (4.21) 


with the same model as in Equation (4.18) for pooling strength over clusters. The 
random intercepts model in this case (Farrell et al., 1997) is then 


logit(mx) = a Bxk 


The goal of small domain estimation is to make predictions of characteristics Y or 
proportions P for the entire population of a domain, namely for populations generically 
denoted N; for the total population of cluster j and domain k (e.g. white females in 
California). One might also wish to make predictions for aggregates of domains and/or 
clusters. Thus, if the binary outcome yj, denoted long term ill status, the domains were 
defined by age band, ethnicity and marital status, and the clusters were counties, one 
might be interested in estimating the proportion of all males who were long term ill in a 
particular set of counties. So if K’ denotes relevant domains, J’ denotes relevant 
clusters, then the estimate of the numerator of P combines 

(a) the known survey total in the relevant domains and clusters, 


Nik 


DDD 


jeJ! keK' i=1 


(b) a regression prediction for the population parts Rx = Nix — nj not included in the 
survey, obtained from a model such as Equations (4.20)-(4.21) with appropriately 
defined predictors. 

So if ji denotes a prediction of the binary outcome for the non-surveyed unit, the 

prediction of P is 
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Nik Rix 
`~ ~ r zat `~ `~ za TAK 
> » > Yik/ > > Nik > > > VYijk 
jeJ! keK' i=1 jeJ! keK' jeJ! keK' i=1 


In practice, the prediction for the non-surveyed population is likely to be at an aggre- 
gated level. An example of this approach, from a Bayesian prediction standpoint, is 
provided by Malec et al. (1997), who consider a binary outcome based on recent medical 
consultation or not. 

As an example of appropriately defined predictors for the non-survey prediction, 
suppose the survey model (4.19) involved a single continuous individual level regressor 
x; Such as income, and the goal was to predict the population wide cluster or domain 
mean Y;. Suppose X; were a cluster or domain wide average on the regression variable, 
obtained probably from other sources. Then the non-survey population prediction 
would be 

Y; = Bir B2% 
= yz;X; gX 
where 
B =y & E&Y N2(0, =) 


as above. Domain wide predictions for univariate continuous outcomes (perhaps using 
the simpler random intercepts model) are exemplified by small area models for per 
capita income (Fay and Herriott, 1979), and for county crop areas (Battese et al., 1988). 


Example 4.8 Economic participation This example takes a sub-sample of the data 
presented by Farrell et al. (1997) relating to economic participation among working age 
women. There were Ty = 72521 women in the 1990 Census sample used by Farrell et 
al., located in a set of J = 33 US states. A sub-sample of 10% of these data, amounting 
to T, = 7184 women, is taken. As in Farrell et al., a logit model for economic activity, 
namely working or being available to work, is proposed. The first two categorical 
predictors {x,, x2} in this model are marital status, namely married vs. otherwise, and 
presence of children under 18 vs. otherwise. We allow for state varying impacts of these 
factors and for homogenous effects of a third factor, x3, namely the woman’s age group 
(1 = 15—24, 2 = 25—34, 3 = 35—44, 4 = 45-54, 5= 55 ). The three factors may be 
taken to define classes, k = 1,..K (where K = 20), as above. Thus, in a multilevel 
pooling model (Model A in Program 4.7), the model for states j and classes k with the 
form 


Vik ~ Bern(trix) 
(4.22) 
logit(tx) =O; BiXe AWE 


as in Equations (4.20)-(4.21) above, with B; of dimension 2, and allowing the age group 
predictors to have constant effects \. Writing ò; = {a;, B;} we then have 6; ~ N3(A, V). 
The fixed effects A and à are assigned N(0, 1000) priors. (Note that in Program 4.8, a 
single string vector is used for the sample subjects, so that the actual program does not 
take this nested structure.) 
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Using the estimated parameters {8;, N} from applying the pooling model to the 
sub-sample, one may seek to predict the full population state economic activity rates 
P; — here the ‘population’ is all Ty = 72521 women. The population wide estimate 
combines actual sample numbers active (among the 7184) with regression predictions 
for the remaining 65 337 women. Specifically the model parameters {8,;, X} are applied 
to state wide proportions married, with dependent children and in the five age bands. 
The predictions for the non-sampled population are combined with the actual sample 
numbers active by state. The overall activity predictions for states may be compared 
with estimates based on the same logit model but without any pooling of strength over 
states. The latter option is a fixed effects or ‘no pooling’ model, and coded as in model B 
in Program 4.7. This keeps the state varying effects of marital and dependent children 
status as in Equation (4.22) but as fixed effects with N(0, 1000) priors. 

Convergence was rapid (under 500 iterations) for estimating Model A under three 
parallel chains, and summaries are based on iterations 500-2500. Under this pooling 
strength model the highest state level intercepts are in Maryland, Indiana and Massa- 
chusetts, while the least deterrence of married status on activity is in South Carolina, 
and the least deterrence of children in the household is in Florida and Washington State. 
The correlations r,, between the random effects show that deterrence from these two 
factors tends to be lower where activity rates themselves are higher (Table 4.10). 


Table 4.10 Random effect and parameter estimates (Model A) 


State Intercepts Effects of being Effects of children 


married 
Mean St. devn. Mean St. devn. Mean St. devn. 
Alabama 0.52 0.28 —1.33 0.29 —0.90 0.23 
Arkansas 0.67 0.33 —1.48 0.33 —1.20 0.29 
California 0.64 0.19 —1.52 0.19 —1.02 0.19 
Connecticut 1.17 0.32 —1.83 0.32 —1.00 0.26 
Florida 0.29 0.34 —1.32 0.31 —0.63 0.28 
Georgia 0.59 0.23 —1.17 0.26 —0.87 0.22 
Illinois 0.77 0.20 —1.53 0.20 —1.14 0.20 
Indiana 1.31 0.36 —1.68 0.34 —1.09 0.26 
Iowa 0.47 0.32 —1.55 0.30 —0.73 0.27 
Kansas 0.73 0.33 —1.60 0.33 —0.83 0.27 
Kentucky 0.52 0.29 —1.83 0.31 —0.96 0.28 
Louisiana 0.42 0.30 —1.52 0.30 —0.70 0.27 
Maryland 1.41 0.32 —1.67 0.30 —1.27 0.27 
Massachusetts 1.32 0.24 —2.09 0.24 —1.09 0.25 
Michigan 0.69 0.29 —1.62 0.27 —1.03 0.23 
Minnesota 1.02 0.29 —1.92 0.31 —0.88 0.26 
Mississippi 0.41 0.34 —1.37 0.30 —0.82 0.28 
Missouri 0.77 0.31 —1.60 0.26 —0.83 0.24 
Nebraska 0.77 0.33 —1.82 0.35 —0.79 0.26 
N. Jersey 1.20 0.24 —2.03 0.25 —0.94 0.23 
N. York 1.06 0.18 —1.88 0.17 —1.20 0.16 
N Carolina 0.90 0.24 —1.43 0.26 —0.81 0.22 


Ohio 0.90 0.21 —1.82 0.21 —0.89 0.18 


(continues) 
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Table 4.10 (continued) 


Oklahoma 0.92 0.29 —1.63 0.28 —0.80 0.25 


Oregon 0.91 0.40 —1.56 0.38 —0.92 0.30 

Pennsylvania 0.74 0.17 —1.97 0.19 —0.88 0.18 

S Carolina 0.44 0.29 —1.03 0.29 —0.81 0.28 

Tennessee 0.63 0.25 —1.28 0.24 —1.03 0.24 

Texas 0.73 0.19 —1.66 0.21 —1.11 0.17 

Virginia 0.82 0.26 —1.55 0.28 —1.16 0.23 

Washington 0.43 0.35 —1.54 0.31 —0.62 0.29 

West Virginia 0.31 0.32 —1.80 0.33 —1.02 0.29 

Wisconsin 0.98 0.25 —1.68 0.26 —0.89 0.24 
Mean St. devn. 2.5% 97.5% 

Mean Effects 

Intercept (1) 0.78 0.10 0.57 0.98 

Married (2) —1.63 0.10 —1.84 —1.43 

Children (3) —0.92 0.09 —1.09 —0.74 


Correlations between Effects 


T12 —0.41 0.21 —0.75 0.03 
T13 —0.34 0.23 —0.74 0.17 
T23 0.01 0.27 —0.49 0.51 


Let P; denote the actual state wide activity rate among all Ty = 72521 women and Ê; 
the prediction. Then Model A has average likelihood —3739 and sum XD; of absolute 
predicted relative deviations 


D; = |(P; — P))/P;| 


averaging 5.3. The effective parameter procedure (Chapter 2) shows there to be 64 
estimated parameters in Model A. This estimate is obtained from comparing the mean 
likelihood and the likelihood at the mean (the likelihood at the posterior mean is —3707). 
By contrast, the no pooling model involves 103 fixed effects; it has likelihood averaging 
—3742 and XD, averaging 7.2. So both in terms of predictive fit beyond the sample and 
penalised fit to actual sample data, the pooling model appears to be preferred. 

The prediction of full state population economic activity is only one of a wide range 
of predictions that could be made. One could also make predictions over combinations 
of the cluster categories and/or over combinations of the states (e.g. economic activity 
rates among women with children aged 25—44 in states grouped by region). 


4.7 REVIEW 


The motivations for multi-level analysis often combine both the need (cf. Chapter 2) to 
pool strength on estimates of unit level effects at level 2 or above (e.g. school exam 
success rates) and an interest, as in Chapter 3, in regression slopes for predictors at 
various levels. The most popularly used packages for multi-level analysis adopt an 
empirical Bayes methodology which may lead to understatement of the uncertainty in 
these various parameters. The fully Bayes approach has been succesfully applied to 
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multilevel discrete outcomes (e.g. Carlin et al., 2001), to multivariate multilevel out- 
comes (Thum, 1997) and to small area estimation (Ghosh and Rao, 1994). Other 
applications of Bayesian modelling extend to multilevel survival data and to structural 
equation models for multi-level data (Jedidi and Ansari, 2001). Whereas non-parametric 
analysis has figured prominently in certain areas of Bayesian application, applications 
of such approaches to multi-level data are less common. Another area for development 
is in analysis of multi-level survey data where the sampling process is informative, for 
example with inclusion probabilities proportional to the size of a school (Pfefferman et 
al., 2002). 
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EXERCISES 


1. In Example 4.2, fit Model A (as described in the text), namely random intercepts at 
area level but with Student ¢ form. First try the direct Student ¢ model with unknown 
degrees of freedom (df) using an appropriate prior (e.g. exponential) on df. Then try 
the scale mixture approach to the Student t, and assess whether any districts are clear 
‘outliers’ (if the degrees of freedom is unknown in the scale mixture method then a 
discrete prior on it may be used in BUGS). 


2. In Example 4.2, try a bivariate cluster effects model with random intercepts and 
slopes on social class IV and V. Is the correlation between varying intercepts and 
slopes clearly negative? 
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3. In Example 4.3 consider assessing heteroscedasticity at level 1 in terms of (a) the IQ 
categories of Table 4.3, and (b) both IQ category and gender. Is there a more 
economical IQ categorisation (by which level 1 variances are differentiated) which 
achieves comparable fit. 


4. In Example 4.4, assess the improvement in fit from Model A to model B after 
correcting for the extra parameters — for example, via the harmonic mean of likeli- 
hoods or via the DIC approach, after estimating the effective parameters in Model B. 
Also try a model where the District percent Protestant acts as a potentially context- 
ual variable, by making the impacts of individual Catholic or Protestant affiliation 
variable over Districts and dependent on this District variable (this is probably more 
simply done with homogenous level 2 variance). 


5. In Example 4.6, use the full measurement error covariance at stage 1 as as supplied 
by Everson and Morris, namely 


y= 148.9 140.4 
~ \ 140.4 490.6 


and re-estimate the model parameters. Set up a procedure to test whether any 
hospital has above average ‘true’ problem rates 0, on both outcomes, relative to 
the average true rates 0). 


6. In Example 4.7, try a bivariate Normal model for state intercepts (as in Model A) but 
including region effects in the linear predictor. Similarly try adding the 1960 income 
ratio to the contemporary income ratio. Are these models well identified, as assessed 
from posterior summaries, from signs of coefficients, and from the effective param- 
eters criterion (the DIC method discussed in Chapter 2) 


7. In Example 4.8, try fitting a fixed effects model (as in Model B), but with constant 
effects of marital and dependent children status, as opposed to state varying fixed 
effects. How does this compare to Model A in predicting the population wide state 
activity rates. 
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CHAPTER 5 


Models for Time Series 


5.1 INTRODUCTION 


In a variety of scientific disciplines we are faced with choosing appropriate models for 
representing series of observations generated in time, and for predicting the future 
evolution of the series. Often the series, although varying continuously in time, is 
observed at discrete intervals, t = 1,.., T. Many series are in fact averages over discrete 
intervals, or assessed at one point during them; thus many econometric series are 
monthly or quarterly, and stock price series are daily. 

The goals of time series models include smoothing an irregular series, forecasting 
series into the medium or long-term future, and causal modelling of variables moving in 
parallel through time. Time series analysis exploits the temporal dependencies both in 
the deterministic (regression) and stochastic (error) components of the model. In fact, 
dynamic regression models are defined when model components are indexed by time, 
and a lag appears on one or more of them in the model specification (Bauwens et al., 
1999). For instance, a dynamic structure on the exogenous variables leads to a distrib- 
uted lag model, and a dynamic structure may also be specified for the endogenous 
variables, the error terms, the variances of the error process, or the coefficients of the 
exogenous variables. 

While simple curve fitting (e.g. in terms of polynomials in time) may produce a good 
fit it does not facilitate prediction outside the sample and may be relatively heavily 
parameterised. By contrast, models accounting for the dependence of a quantity (or 
error) on its previous values may be both parsimonious and effective in prediction. The 
autoregressive models developed by Box and Jenkins (1976) and Zellner (1971) are often 
effective for forecasting purposes (see Section 5.2), but dynamic linear and varying 
coefficient models (Section 5.5) have perhaps greater flexibility in modelling non-sta- 
tionary series. 

Bayesian methods have been widely applied in time series contexts and have played a 
significant role in recent developments in error correction and stochastic volatility 
models (Sections 5.4 and 5.6). They have advantages in simplified estimation in situ- 
ations where non-standard distributions or nonlinear regression are more realistic. For 
example, in state space modelling the standard Kalman filtering methods rely on linear 
state transitions and Gaussian errors, whereas Bayes methods provide a simple ap- 
proach to include both continuous and discrete outcomes. In ARMA models a Bayesian 
perspective may facilitate approaches not limited to stationarity, so that stationarity 
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and non-stationarity are assessed as alternative models for the data series. A Bayes 
approach may also assist in analysis of shifts in time series, where likelihood methods 
may either be complex or inapplicable. Examples include the analysis of a permanent or 
temporary break point in a series (see Section 5.7), where a prior for the break point or 
points might be taken as uniform over a range (Tı, T2) within (1,7). 


5.2 AUTOREGRESSIVE AND MOVING AVERAGE MODELS UNDER 
STATIONARITY AND NON-STATIONARITY 


A starting point in dynamic regression models is often provided by considering dynamic 
structures in the outcomes. Autoregressive process models describe data driven depend- 
ence in an outcome over successive time points. For continuous data y,, observed at 
times ¢ = 1,.. T the simplest autoregressive dependence in the outcomes is of order 1, 
meaning values of y at time ¢ depend upon their immediate predecessor. Thus an, AR(1) 
model typically has the form 


Ve= Bh PY uw t=2,..T (5.1) 


where p represents the level of the outcome, and p models the autocorrelation between 
successive observations. After accounting for such observation driven serial depend- 
ence, the errors may (at least initially) be taken as exchangeable white noise and to 
follow, for example, a Normal density, u, ~ N(0, o°) with constant variance and preci- 
sion t = 1/o? across all time points t, and cov (us, ur) = 0. 

If the data are centred, then a simpler model may be estimated 


Yt = PY- ut (5.2) 


Additional dependence on lagged observations y;-2, i-3, - Yt-p leads to AR(2), AR(3), 
.. AR(p) processes. It may be noted that (5.1), if taken to apply to t = 1 also, implies 
reference to unobserved or latent data yo. If a prior on yo is included in the model 
specification, this leads to what is known as a full likelihood model. For an AR(p) 
model with times ¢ = 1,..p included there are p implicit latent values, yo, y_1,..y1—p in 
the full likelihood model. 

Classical estimation and forecasting with the AR(p) model rest on stationarity, which 
essentially means that the process generating the series is the same whenever observation 
starts: so the vectors (y1, . . yk) and (yr, .. -Yı k) have the same distribution for all ¢ and k. 
Specifically, the expectations E(y,) and covariances C(y;, y; k) are independent of t. For 
the stationary AR(p) model to be applicable, an observed data series may require initial 
transformation and differencing to eliminate trend. Non-stationary time series can often 
be transformed to stationarity by differencing of order d, typically first or second 
differencing at most. This may be combined with a scale transformation, e.g. 
Y, = log (yx). 

Using the B operator to denote a backward movement in time, a first difference 
(d= 1)in y, 


Zt = Yt — Yt-1 
may equivalently be written 
Z,=y,— By, = (1 — B)y, 
Then with z; now the outcome, the AR(1) model (5.2) becomes 
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Zt — PZt-1 = Ut 


or 
z,1 — pB) = u 


An AR(p) process in z; leads to a pth order polynomial in B, so that 
z- pB-pB -... P B’) = ur 
for which an alternative notation is 
p(B)z, = ur 


The process is stationary if the roots of p(B) lie outside the unit circle. For instance if 
p = 1, the series is stationary if |p,| < 1. 

In the AR(p) model, an outcome depends upon its past values and a random error 
or innovation term u;. If the impact of u, is in fact not fully absorbed in period ¢, there 
may be moving average dependence in the error term also. Thus, for centred data, the 
model 


Zt — Py Zr-1 = Ut — Oiu (5.3) 


defines a first order moving average MA(1) process in u, combined with AR(1) depend- 
ence in the data themselves. In BUGS, moving average effects in the u, for metric (e.g. 
Normal or Student #) data may require an additional measurement error term to be 
introduced, because the centred density adopted in BUGS for Normal and Student ¢ data 
assumes unstructured errors. In general an ARIMA(yp, d, q) model is defined by depend- 
ence up to lag p in the observations, by q lags in the error moving average, and by 
differencing the original observation (y,) d times. An ARMA( p, q) model in y;, therefore, 
retains the original data without differencing. In the general ARMA( p, q) representation, 


pP(B)y: = 0(B)u; (5.4) 


the process is stationary if the roots of p(B) lie outside the unit circle, and invertible if 
the roots of 0(B) lie outside the unit circle. For instance, if p = q = 1, the series is 
stationary and invertible if |p,| < 1 and |01| < 1. 

In a distributed lag regression predictors x,, and their lagged values, are introduced in 
addition to the lagged observations y;_1, y;-2, etc. A distributed lag model for centred 
data has the form 


y= Xo BmXt-m ut (5.5) 
m=0 


while a model with lags in both y and x may be called an Autoregressive Distributed Lag 
(ADL or ARDL) model (see Bauwens et al., 1999; Greene, 2000): 


p(B)y: = B(B)x;, u 
The latter form leads into recent model developments in terms of error correction 


models. 


Dependent Errors 
In the specifications above, the errors u, are assumed temporally uncorrelated 
with diagonal covariance matrix and autocorrelation is confined to the observations 
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themselves. However, if correlation exists between the errors then the covariance matrix 
is no longer diagonal. Let e; be correlated errors with 
y=a Bry & (5.6a) 
and suppose that an AR(p) transformation of the e; is required 
VBE, = u; 
in order that u, is unstructured with constant variance, where 
y(B) = 1 — yıB -yB -...y,B? 
A frequently occurring model is one with AR(1) errors ¢, such as 
Vie=a Bx & 
& = VEr1 U 
More generally, regression models may be defined with ARMAQ(p, q) errors 
Er — Y18&r-1 — YoEr-2- + — YpEr-p = Ur — OU; — OyU;_2.... — Ogttr_g (5.6b) 


To facilitate estimation, the AR(1) error model may be re-expressed in non-linear 
autoregressive form, for observations t > 1 subsequent to the first, and with homogen- 
OUS errors Ur, 


Ve=H Wirt aay Bx —yRx-1 ur 
= yy — Bx) ed-y) Bx u 


So the intercept in the original model is obtained by dividing the intercept in the 
transformed data model by 1 — y. 


(5.7) 


Multivariate series 
The above range of models may be extended to modelling multivariate dependence 
through time, with each series depending both on its own past and the past values of 
the other series. One advantage of simultaneously modelling several series is the possi- 
bility of pooling information to improve precision and out-of-sample forecasts. Vector 
autoregressive models have been used especially in economic forecasts for related units 
of observation, for example of employment in industry sectors or across regions, and of 
jointly dependent series (unemployment and production). 

For example, a time series of K centred metrical variables Y, = (Yir Vr- -YK is 
a multivariate Normal autoregression of order p, denoted VAR(p), if it follows the 
relations 


Y, = p Yı  ....®,)Vi-p Ui 
U, ~ Nx(O, V) 
where the matrices ®,;,..®,, are each K x K, and the covariance matrix is for 
exchangeable errors uir, uz, .., ugr. Then if K = 2, ®,; would consist of own-lag coeffi- 


cients relating Yı; and Y, to the lagged values Yı, ;-; and Y2, ; and cross-lag coeffi- 
cients relating Yj, to Y2,;-; and Yy to Yj, 1. 


(5.8) 


5.2.1 Specifying priors 


Bayesian time series applications with autoregressive and moving average components 
have included all the above modelling approaches, and have also been applied in models 
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combining state-space and classical time series concepts (Huerta and West, 1999). 
Among the questions that are involved in specifying priors for ARMA type model 
parameters are whether stationarity and invertibility constraints are taken, whether a 
full or conditional likelihood approach is used, and assumptions made about the 
innovation errors. As discussed in Chapter 1, prior elicitation consists in incorporating 
relevant background knowledge into the formulation of priors on parameters!. Often, 
relevant knowledge is limited, and diffuse or ‘just proper’ priors are called on. This 
raises questions of sensitivity to prior specifications, for instance on variances (Daniels, 
1999) or on time series assumptions (e.g. on stationarity or otherwise or on initial 
conditions), and the reader is encouraged to experiment with alternative priors in the 
worked examples of the chapter. 


Autoregressive and ARMA models without stationarity 
Consider first the autoregressive AR(p) model in the endogenous variable, 


p(B)y;, = uy 


Unlike classical approaches, a Bayesian analysis of the AR(p) model is not confined 
to stationary processes. As emphasized by Zellner (1971) a prior assumption of statio- 
narity in the AR(p) process may be regarded as a modelling assumption to be assessed, 
rather than a necessary restriction. Hence in an iterative sampling framework, an 
autoregressive model may be applied to observations y, without pre-differencing to 
eliminate trend, and the probability of stationarity assessed by the proportion 
of iterations s = 1,..,S where stationarity in the coefficients p® at iteration s actually 
held in terms of roots located outside the unit circle. A significant probability of non- 
stationarity would then imply the need for differencing, different error assumptions, or 
model elaboration, for example, to a higher order AR model (Naylor and Marriott, 
1996). 

One approach of Zellner (1971) without a stationarity constraint is to use a non- 
informative reference prior, such as Jeffrey’s prior, with 


T(P},---Pp, T) X 7! 

where t= 1/o*. In BUGS implementation of this approach would require direct 
sampling from the full conditionals. As with any non-informative prior, potential 
problems of identifiability may be increased, whereas identifiability generally improves 
as just proper or informative proper priors are adopted. 

The reference prior approach may be generalised to include AR(p) processes with 
normal-gamma conjugate priors (Broemeling and Cook, 1993). For example, with a 
gamma G(a, b) prior for 7, p|t is taken to be multivariate Normal N(r, to), where r is 
the prior mean on the lag coefficients, and Uo is a p x p positive definite matrix. A 
straightforward analysis is defined by conditioning the likelihood on the first p obser- 
vations Yı = {y1, y2,.--, Yp}, so avoiding the specification of a prior on the latent pre- 
series value. The likelihood then only relates to observations Y> = {Vp 1, Yp 25-++5 Yny- 
The conditional likelihood is then 


' In Bayesian econometrics, a distinction is sometimes made between two types of formal elicitation 
procedures, structural or predictive — structural methods involve assessment of the quantiles of the prior 
distribution of parameters, drawing on theoretical models or past experience (Bauwens et al., 1999; Kadane, 
1980). 
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Foe p, 1) x 0» eo -0s 5 pawi) (5.9) 


t=p 1 


The posterior is proportional to the product of the two priors and the conditional 
likelihood. 

Naylor and Marriott (1996) discuss a full likelihood analysis of the ARMA model 
without stationarity constraints by using proper but relatively ‘weak’ priors on the 
latent pre-series values Yo = (Vo, y-1, . -, Y1-p) and Eo = (uo, u_1,..,Wj—q). For instance, 
if the observed series is assumed Normal with mean p and conditional variance o°, 
Naylor and Marriott suggest the pre-series values Yo be taken as Student ¢ with low 
degrees of freedom, having the same mean as the main series but a variance larger by a 
factor k > 1, namely xo’. (This is equivalent to dividing the precision t by x.) If there 
are several pre-series values (when p > 1), a multivariate Student ¢ might be used. Note 
that Zellner (1971, p. 87) suggests a prior for yo that does not involve any of the 
parameters of the main model. 


Priors on error terms 

The assumption of white noise errors in the AR(p) model may need to be assessed with 
more general priors allowing for outlier measurements. However, interpretations of 
outlying points in time series are made cautiously. Outliers at time t may be clearly 
aberrant, and either excluded or replaced by interpolated values taking account of 
surrounding values (Diggle, 1990). On the other hand, especially in economic time 
series, they may reflect aspects of economic behaviour which should be included in 
the specification (Thomas, 1997). 

Some fairly conventional approaches are for a Normal mixture distribution or 
Student ¢ errors to replace the usual Normal error assumption (Hoek et al., 1995; 
West, 1996). Thus, let A be the small probability of an outlier (e.g. A = 0.05), and let 
the binary indicator 


J, ~ Bernoulli(A) 


govern whether the observation ¢ is an outlier. Then one alternative (West , 1996) to the 
Normal errors AR(1) model in (5.1) is 
Yte= M PY- Ut 


where u, ~ N(0, K;o7), and where random or fixed effect parameters K, > 1 inflate the 
variance when J; = 1. This is known as an innovation outlier model and may be written 
as 


u, ~ (1 — ANO, 6”) ANO, Ko’) (5.10) 


If the Student ¢ is used as an outlier model for the innovations then the most appropriate 
option is the scale mixture form — this includes weights w, averaging 1 which scale the 
precision — so low weights (e.g. under 0.5) indicate possible outliers. In models with 
autocorrelated errors, such as 


Ve=a Bx & 
Et — YEr] = Ut 


the innovation outlier model would apply to the uz. 
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One may also define additive outliers corresponding to shifts in the observation series 
that may not occur for all time points. So, following Barnett et al. (1996) one might 
define a model 

Vr=a Bx OF & 

Er — Yi Er-1 = Ut 
with o; ~ N(O, Ki,;07), where Kj; is either 0 or positive (corresponding to times when an 
additive outlier does or does not occur), and u, ~ N(0, K2,o7), where Kx is either 1 or 
greater than 1. One might then model the two outliers jointly, for instance via a discrete 
set of possible values for K; = {Kj,;, Kar}. Barnett et al. illustrate this with a prior for K, 
consisting of (0, 1), (3, 1), (10, 1), (0, 3), (0, 10) with selection among them based on a 
multinomial rather than binary indicator J;, but with prior probabilities possibly biased 
towards the null option (0, 1). For example, prior probabilities on the just named 
options might be (0.9, 0.025, 0.025, 0.025, 0.025). 

Another approach to additive outliers is developed by McCulloch and Tsay (1994). 
They consider first a random level-shift autoregressive (RLAR) model 


Vip, & 
Be = M ON, 
E = Vi E-1 Y2Er-2 sete uy 


where 6, is Bernoulli with probability A and governs the chance of a level shift at time t, 
the terms n, ~ N(0, &’) describe the shifts, the e; are autoregressive errors and the white 
noise errors u; are N(0, o°). The shift variance ° is taken as a large multiple (e.g. 10, or 
100) times the white noise variance o*. The probability of a shift A is beta with 
parameters favouring low probabilities, for instance A ~ Beta(5, 95). The above model 
may be re-expressed as 


V= By VO Be) WO2 H) Ue 
M= Mi Òm, 
McCulloch and Tsay also propose a specialised additive outlier model, namely 


Y5 Ot & 
0, = ÒM; 
Et = Y1 1-1 V2Er-2 er Ur 


which is the same as the RLAR model except that the level u,( = 0+) no longer depends 
upon its previous value. This model can be re-expressed as 

Y= 0r Yr- — 0-1) Y2r-2 — 0-2) . Ur 

Or = Òm; 
Another version of this model (Martin and Yohai, 1986) has 

y= (1 -òde öm, 

with e; again autoregressive. 
Priors consistent with stationarity and invertibility 


Prior assumptions regarding stationarity or non-stationarity (and invertibility or non- 
invertibility) may interrelate to other aspects of time series model specification. Thus, 
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specifying priors for a full likelihood involving all observations may be more straight- 
forward for a stationary model. Consider the AR(1) model 


YVtS M PY- Ut 
Then for a stationary process with p € [— 1, 1], and with exchangeable errors u, with 
mean 0 and variance o°, the first observation y; has mean p and conditional variance 


0/0 -p° (5.11) 
rather than ø?. This analytic form corresponds to assuming an infinite history for the 
process. For p > 2, a matrix generalisation of (5.11) is involved. For instance for p = 2 
with lag coefficients {p,, p2}, the equivalent of (5.11) is a bivariate Normal for 
Yı = {y1, y2} with covariance matrix o°}, where 


D=RSR KDK 
where K;(2) = (1, 0Y is a (2 x 1) vector, and 


—]|Pi P2 

s[i 8 
A prior constrained to stationarity might alternatively involve ensuring lag parameters 
in the acceptable region combined with a prior density on the pre-series values. Thus, 
for p = 1 the prior for the AR(1) model (5.2), the prior would be m(yo, p, 7), rather than 
T(p, T), while for p > 1 it would be necessary to specify priors on yo, y-1,..)i-p- 
Specifying priors on the pre-series values under stationarity may be avoided by back- 
casting (Ravinshanker and Pay, 1997, p. 182), a procedure which takes advantage of the 
fact that the model for a stationary time series taken forward in time also applies with 
time reversed (see Box and Jenkins, 1970). 

Some authors suggest simple rejection sampling, with samples of ARMA coefficients p 
or ð accepted (as a block) if they lie in the acceptable region (Chib and Greenberg, 1994). 
Another option on the priors on the AR coefficients P = {p,,..p,} consistent with 
stationarity, is reparameterisation of the p, in terms of the partial correlations r; of the 
AR(p) process (Marriott and Smith, 1992; Marriott et al., 1996). In the AR(p) model let 


P) — (AP) (p) pP) 


pP = (Pi > Ps 


with pP the jth coefficient. Then the stationarity condition is equivalent to restrictions 
that |r| < 1 for k = 1, 2,..p. The transformations linking priors on r; to the implied p; 
are for (k = 1,..p and i= 1,..k — 1) 


k 
Pk = Tk 
k k=l k=l 
pO = pe) = rep” 
So for p = 3 the transformations would be 


3 
ce 


(5.12) 


p 
3 2 2 2 ; 

pË = pP -rp = pP -rr (for k=3, i= 1) 
3 2 2 2 ; 

pf = pf — rsp? = 1 — rap (fork=3,i=2) 

Q) _ C9) 

1 =P 


p -mnp =r -rr (for k = 2, 7= 1) 


One may use U(— 1, 1) priors on the r; but alternative transformations have be pro- 
posed. Thus, let r;* be a normal or uniform draw on the real line, and then r; are given by 
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solving i = log([1 r;]/[l — 7]). Jones (1987) proposes beta variables Ti, P35 135+ +The 
and then transforming to the interval [—1,1] via rı = 2r{ — 1, ro = 25 — 1, etc. Priors on 
the 0 coefficients in Equation (5.4) consistent with invertibility may be obtained by a 
parallel reparameterisation, for k = 1,..q and i = 1,..k — 1 (Monahan, 1984; Marriott 
et al., 1996). 

Chib and Greenberg (1994) discuss priors for the ARMA(p, q) errors model (5.6) 
that are constrained to stationarity and invertibility. They emphasise that nonstatio- 
narity in the data themselves may still be modelled by unrestricted coefficients p on 
lagged y;. 


Further aspects of priors on regression coefficients 

If a stationarity constraint is not imposed then a flat prior may be chosen for the 
P = {p,,--p,} but proper priors, such as p; ~ N(0, 1), j= 1,..,p provide a relatively 
vague but proper alternative. To take priors on the p; that had larger variances would 
neglect the typical pattern of autoregressive coefficients on the endogenous variable, 
with values exceeding 1 being uncommon, except in short term explosive series. More 
specialised priors on the coefficients on lagged endogenous or exogenous variables may 
be applied in particular forms of model. In vector autoregressions, Litterman (1986) 
specifies a prior mean of unity for the first own-lag coefficient (relating yj; to yx, 1), 
but zero prior means for subsequent own-lag coefficients, and for all cross variable lags. 
The form of the own first lag prior follows the observed trend behaviour of many 
undifferenced economic variables in approximating a random walk, namely 


Vkt = Vk, t-1 Ukt BED KetH9 og F 


In distributed lag models, Akaike (1986) discusses smoothness priors, namely Normal 
or Student ¢ priors for differences of order d in the distributed lag coefficients B,,,. For 
example, a first order smoothness prior would specify B, as N(B,,_1, or) or equivalently 
An = Bm T, Bm-1 a N(O, o). 

However, the prior on autoregressive, moving average or regression parameters is 
expressed, model selection may be developed by procedures such as those of George and 
McCulloch (1993) or Kuo and Mallick (1994), and discussed in Chapter 3. In this way, 
alternative models defined by inclusion or exclusion of coefficients may be evaluated. 
For instance, if J, = 1 or J, = 0 according as the kth autoregressive coefficient p, is 
included or excluded, then one might set prior probabilities that J, = 1 which decline as 
k increases (Barnett et al., 1996). 


5.2.2 Further types of time dependence 


Other forms of time dependence in the y, may be combined with autoregression of y, on 
previous values of the series, while still assuming the errors are uncorrelated. This rather 
catholic approach to time series modelling is more likely when there is no necessary 
restriction to stationarity and the autoregressive dependence in the y; is just one among 
several modelling options. Thus one might consider modelling trend without necessarily 
differencing beforehand. However, as well as simple trends in ¢ or log(t), there are often 
periodic fluctuations in time series such that a series of length T contains K cycles (timed 
from peak to peak or trough to trough). So the frequency or number of cycles per unit 
of time, is 


f=K/T 
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An appropriate model for a series with a single cycle is then 
yı = Acos (2mnft P) u 


where A is the amplitude and P the phase of the cycle, and period 1/f, namely the 
number of time units from peak to peak. To allow for several (r) frequencies operating 
simultaneously in the same data, the preceding may be generalised to 


y=) Ajcos(Qnfjit Pi) u 
j=l 


For stationarity to apply, the A; may be taken as uncorrelated with mean 0 and the P; as 
uniform on (0, 277). Because 


cos (2mf;t P;) = cos (2mf;t) cos (P;) — sin (27,2) sin (P;) 


the model is equivalently written 


v= So fay cos Of) Bj sin Qah} Uy (5.13) 
j=1 


J 


where a; = A; cos P;, B; = —A; sin Pj. 


5.2.3 Formal tests of stationarity in the AR(1) model 


There is a wide literature on the question of trend stationarity of the outcome y; in the 
AR(1) model for, with non-correlated Normal errors uz: 


Yr= he PVH-1 Ut (5.14a) 


where u; ~ N(0, o°). If |p| < 1 in this model, then the process is stationary with a 
variance o7/(1 — p°) and long run mean 


He = w/(1 — p) (5.14b) 


as can be seen by taking expectations in Equation (5.14a). If |p| < 1 the series will tend 
to revert to its mean level after undergoing a shock. If, however, p = 1, the process is a 
nonstationary random walk with its mean and variance undefined by parameters in 
Equation (5.14). 

Tests for nonstationarity may therefore compare the simple null hypothesis Ho: p = 1 
with the composite alternative Hı: |p| < 1 and classical tests of nonstationarity revolve 
around this type of unit root test. If the hypothesis p = 1 is not rejected, then this implies 
that the differences Ay, = y; — y,-1 are stationary (this is known as difference statio- 
narity as opposed to trend stationarity in the undifferenced outcome). 

The simple model (5.14) may be extended (Schotman and van Dijk, 1991) to suit the 
observed series being considered by adding trends in ¢ (e.g. linear growth) and lags in 
Ay;. These modifications are intended to improve specification and ensure that the 
assumption that the errors u; are uncorrelated in fact pertains. As an example, Hoek 
et al. (1995, Equation (16)) consider the series of Nelson and Plosser (1982) in terms of 
an extended model 


Y= p pyr AL Ayer QAy-2 u (5.15) 
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where Af models a linear trend. Bauwens et al. (1999, p. 166) consider an alternative 
non-linear form of the AR model, also involving a trend in time At. For an AR(1) model 
this is expressed as 


(1 — pB); — u — AD) = u; 


which can be rewritten as 


Yr=pv-1 PA (l—pu M) u 


When lags in Ay, are introduced, an extended version of the nonlinear form would be 
specified by the model 


Ye= Py- PA (L—p)(w AD bAy-1 PAy-2 uM (5.16) 


Bauwens et al. report differences in the behaviour of nonlinear and linear versions of the 
AR model under nonstationarity or unit root situations. 

Bayesian tests of nonstationarity in the original AR(1) model (5.14) or in these 
extended versions may follow the classical procedure in testing explicitly for the simple 
hypothesis Hp: p = 1 versus the composite alternative |p| < 1. Thus Hoek et al. (1995) 
considers a prior for p confined to non-explosive values, but putting a mass of 0.5 on the 
unit root p = 1. Other values of p are uniformly distributed with mass 1/(2A) between 
[1-A, 1) where 1 > A > 0. For instance, taking A = 1 gives the prior 


m(p)=0.5 p=1 
m(p)=0.5 pe[0, 1) 


Bayesian tests of non-stationarity may also (Lubrano, 1995a) compare the composite 
alternatives 


Ho:p > 1 as against Hi:p < 1 


If there is genuinely explosive behaviour in the series, then artificially constraining the 
prior to exclude values of p over | may be inconsistent with other aspects of appropriate 
specification. The posterior probability that p > 1 is then a test for nonstationarity. 

In either approach, other considerations may be important. The first is the impact of 
outliers or nonconstant variance. Hoek et al. (1995) demonstrate that assuming Student 
t errors for the innovations provides robustness against outliers that cause a flatter 
estimate of p than the true value, and so lead to over-frequent rejection of non- 
stationarity. The second consideration is the impact of initial values — for example, 
the unobserved value yo in the AR(1) model (1). If p = 1, then this particular model may 
be expressed as 


II 
= 
4 
x 
S 


namely as an accumulation of innovations or shocks plus the initial value. Hence, 
conditioning on initial values (e.g. taking yı as fixed in the AR(1) model and so not 
referring to the unknown yo) may distort tests of nonstationarity. 
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5.2.4 Model assessment 


Time series fit measures often adopt cross-validatory principles as well as standard 
measures of fit to all the data points. Within sample fit measures include the DIC or BIC 
criteria, marginal likelihood approximations (Gelfand and Dey, 1994), or the minimum 
predictive loss criteria of Gelfand and Ghosh (1998) and Laud and Ibrahim (1995). Out- 
of-sample cross validation is demonstrated by Nandram and Petrucelli (1997), who use 
a predictive density fw; 1|y1, V2, - - Yr) for assessing one step ahead forecasts within the 
span t= 1,.., T — 1 of the observed series. 

If model comparisons involve selection of subsets of coefficients (e.g. by one or more 
binary inclusion indicators), then posterior vs. prior probabilities of inclusion are 
obtainable from standard MCMC output. As discussed above, model choice may be 
influenced by (or be sensitive to) features of the data or model specification such as 
outliers and assumptions about priors. Model checking (e.g. in assessing possible 
outliers by CPO statistics) may interact with procedures assessing sensitivity to priors 
—if for instance, robust error assumptions are adopted to lessen the effect of outliers. 


Example 5.1 Oxygen inhalation The first example is of an AR(p) model for the 
outcome y; without a prior stationary constraint, and with uncorrelated errors. Specif- 
ically, we consider undifferenced data on a burns patient reported by Broemeling and 
Cook (1993). The series consists of 30 readings y, of the volume of oxygen inhaled at 
two minute intervals (Figure 5.1)”. Broemeling and Cook adopt the AR(1) model in 
Equation (5.14) with the simplification provided by a conditional likelihood, taking the 
first observation yı as a known constant and avoiding reference to the initial condition, 
or latent value yo. They investigate the question of stationarity of the process, by 
sampling directly from the analytic form of the marginal posterior for p, and assessing 
the proportion of samples where |p| < 1. If stationarity is confirmed then it makes sense 
to derive the long run average m, in the AR(1) model of Equation (5.14). Taking 
+ = 1/o*, Broemeling and Cook adopt a flat prior on 0 = (u, p, T), namely 
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Figure 5.1 VO2 Readings 


? Data kindly provided by Peyton Cook. 
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TCO, T) xT! (5.17) 


Here two forms of just proper gamma prior for t are adopted, one t~G(0.001, 0.001) 
which approximates Equation (5.17), whereas the prior t~G(1, 0.001) approximates a 
uniform prior for the precision. The prior for wp is taken as N(300, 10°), weakly 
reflecting the average level of the observations, and p is taken as N(0, 1), consistent 
with a model not assuming a stationarity constraint. Finally, the fact that the outcome is 
positive means the Normal sampling for y is truncated below at zero. 

To implement this conditional likelihood method, three chains with dispersed? 
starting values are run for 20000 iterations, with convergence apparent from around 
iteration 1000. With the prior tT~G(0.001, 0.001), this gives posterior means for u and p 
of 287 (standard deviation 79) and 0.30 (s.d. 0.19), similar to those obtained by 
Broemeling and Cook. The probability of non-stationarity is negligible at 7E-4. With 
the alternative prior tT~G(1, 0.001), p is estimated at around 0.295 (s.d. 0.18) and 
posterior summaries are very similar for other parameters. 

The Durbin—Watson statistic (DW in Program 5.1) is used to assess auto-correlation 
in the residuals of the fitted model and is approximately 2 when there is no correlation. 
One might assess this (and hence the potential relevance of a model with correlated 
errors) by a step function in BUGS comparing DW with 2. Here it is clear that a 
conditional likelihood approach leaves no correlation in the errors, and the Durbin- 
Watson statistic has posterior average of 2.07, very close to 2. Another option to assess 
the validity of the white noise assumption for u+ is an analysis of the posterior residual 
estimates, i, = y; — );. For example, an effectively zero correlation parameter between 
i, and it;; would be consistent with white noise. 

In terms of fit criteria to compare against succeeding models, the conditional likeli- 
hood AR(1) approach yields, with t~G(1, 0.001), a CPO based marginal likelihood 
estimate of —155.8 (Gelfand and Dey, 1994), a Predictive Loss Criterion (PLC) of 
134200 (Gelfand and Ghosh, 1998), and a mean square prediction error (one step 
ahead) of 4680. The individual CPOs, when scaled to have maximum 1, show the lowest 
probability of belonging to the data for yi9, with a scaled CPO of 0.027. 

For a full likelihood approach without a stationary constraint (Model B in Program 
5.1), it is necessary to specify priors on the latent pre-series data or errors. Thus, take 
yo ~ (300, ko’, 2), where x is assigned a G(0.01, 0.01) prior, which has mean 1 but 
allows values of k much in excess of 1. The posterior mean (and s.d.) of p is then 0.24 
(0.19), and the posterior median of k around 10. The one step predictive error as 
compared to the conditional likelihood model (and defined over the same times as in 
that model) is worsened to 4785, and the predictive loss criterion is also higher at 
137100. On the other hand the marginal likelihood estimate (over cases 2-30) is 
virtually unchanged at —155.9. The pre-series value is estimated at 506 compared to 
yı = 485. 

A further elaboration (Model C in Program 5.1) allows for outliers, with a Normal 
mixture replacing the standard error assumption. Thus, 


us ~ (1 — A)N(0, o°) ANO, Ko’) 


where k > 1 inflates the variance for outliers. Here, A is set to 0.05, and a discrete prior 
with five values {5, 10, 15, 20, 25} is set on k. This prior is to ensure an outlier mixture as 
opposed to a more general mixture, where A might for instance be around 0.5, and k not 


3 These are null start values, and the values provided by the 2.5th and 97.5th percentiles of a trial run. 
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markedly greater than 1. The estimate of p is similar to Model B, with mean around 0.27 
(and s.d. 0.20). The highest outlier probability, around 20%, is for observation 19 (501). 
However, fit does not improve; for instance, the one step ahead predictive error is 5410. 

In a final analysis, Model D, the full likelihood model (Model B) is modified to 
include a coefficient selection indicator, consistent with the approach of Kuo and 
Mallick (1994). Thus 


YtF=V Jri Urt 


where J, is binary. The prior probability 7, that J, = 1 is taken as 0.9, so posterior values 
of my (based on counting samples where J, = 1) clearly lower than 0.9 would tend to cast 
doubt on the value of including an AR(1) coefficient. In fact, my is estimated around 0.85, 
so there does not seem clear evidence for or against including this parameter. 


Example 5.2 Nelson—Plosser series As an illustration of models for analysing trend as 
in Section 5.2.3, one of the 14 Nelson—Plosser series on the US economy (Nelson and 
Plosser, 1982) is considered. This is the real GNP series, taken up to 1988. Both linear 
and nonlinear AR models, as in Equations (5.15) and (5.16) respectively, are considered, 
and full likelihoods referring to latent pre-series values are assumed. The original series 
is shifted by subtracting the first observation from all observations, i.e. 
Zt = Yt Vi 
and this provides justification — when combined with low precisions — for assuming a 
prior mean of zero for the pre-series values, which are here Zo, z_1, Z—2. 

Student ¢ innovation errors via a scale mixture with v degrees of freedom and weights 
w, are adopted, since Hoek et al. (1995) report some of these series to be clearly heavy 
tailed. A gamma prior for the degrees of freedom is taken with sampling constrained to 
[1, 100], so encompassing both the Cauchy density (v = 1) and an effectively Normal 
density (v = 100). Hoek et al. (1995, p. 43) suggest varying the upper limit for the 
degrees of freedom as a sensitivity analysis. As in Example 5.1, a G(0.01, 0.01) prior is 
taken for the factor k, which scales up the main data variance in a way appropriate for 
the Student ¢ distributed pre-series latent data. 

Summaries are based on three chains with 20000 iterations and 3000 burn-in (to 
ensure convergence of v). The real GNP series is stationary according to the linear 
model, with zero probability that p > 1 (Table 5.1). However, GNP is marginally non- 
stationary (with the same probability standing at 0.09) under the nonlinear model 
(5.16). By comparison, Bauwens et al. (1999, p. 189) find probabilities that p > 1 of 
0.002 and 0.033 under the linear and nonlinear models. The linear model does, however, 
have slightly better one step-ahead prediction error, error under predictive loss, and 
marginal likelihood than the non-linear model. Figure 5.2 plots z; against the weights of 
the scale mixture: lower weights apply to observations at odds with the remainder of the 
data. The lowest weight is for the depression year 1932. 

As a third modelling approach, an additive outlier model (Section 5.2.1) is applied. 
This modifies the linear model in Equation (5.15) to the form 


Z= eB YWZ) AL PAZ- P2ÅZ-2 ur 
0t = ÒM; 


Following McCulloch and Tsay (1994), o, for years preceding the series are taken as 
zero. This model finds evidence of outliers in the years 1921, 1929, 1938 and, to a lesser 
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Table 5.1 GNP series (second halves of runs of 25000, iterations) 


Mean St. devn. 2.5% Median 97.5% 


Linear AR 

Prob of Nonstationarity 0 

u —0.035 0.020 —0.076  —0.035 0.001 
ò 0.006 0.001 0.003 0.005 0.009 
y 0.824 0.047 0.729 0.827 0.909 
O 0.374 0.106 0.163 0.375 0.579 
b> —0.007 0.109 —0.219  —0.008 0.208 
v 18.9 24.0 2.3 7.2 89.5 


Non-Linear 


Prob of Nonstationarity 0.029 

u —0.374 0.307 —0.991 —0.365 0.273 
8 0.032 0.004 0.022 0.032 0.038 
y 0.876 0.063 0.759 0.872 1.002 
Qb; 0.370 0.104 0.162 0.372 0.570 
b> —0.052 0.118 —0.286 —0.051 0.175 
K 9.2 19.8 1.0 3.2 59.2 

v 19.6 24.5 2.3 7.3 89.6 


Linear AR, Additive Outlier Model 


Prob of Nonstationarity 0.0 

u —0.031 0.020 —0.076  —0.029 0.004 
ò 0.005 0.002 0.002 0.005 0.009 
y 0.838 0.053 0.721 0.842 0.925 
ob; 0.426 0.106 0.214 0.426 0.632 
b> —0.046 0.110 —0.254  —0.048 0.178 


v 13.6 19.2 2.0 5.6 771 


degree, 1954. The outlier probability A is estimated at 0.051. The one step and predictive 
loss prediction errors of this model are better than either the linear or non-linear AR(1) 
models. 


Example 5.3 US coal production Moving average modelling for innovations com- 
bined with autoregression in a metric outcome, as in Equation (5.4), may be illustrated 
with data from Christensen (1989). He considers a time series y; of US coal production 
from 1920 to 1987 (T = 68), and its analysis for forecasting purposes by various 
ARIMA models. Because a broadly upward trend is apparent in the observed series 
he advocates first differencing, so that the analysis is of 


Zz =y fe t=1,..67 


It may be noted that a linear trend (e.g. the At term in Equation (5.15)) corresponds to a 
nonzero mean in the series of first differences. Figure 5.3 shows the original undiffer- 
enced series and the steady growth in output after 1960. Christensen concludes that 
(p, d, q)= (1, 1, 2) models with and without intercepts are the best fit, and that the version 
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Figure 5.2 Plot of differenced GNP series and scale mixture weights (w) 
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Figure 5.3 US coal 


with an intercept is more plausible as it better reflects the actual upward trend in the 


data when applied to making forecasts. 


However, to allow for the possibility of a model with a zero intercept, the first model 
adopted here uses a binary indicator J, ~ Bern(0.5) to select a zero or non-zero intercept 


across all periods. Hence, the model is 


Zt=J N PZ- Uy — Oiu — Ozur- (5.18) 
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The second model here proposes a threshold + within the estimation period such that 
I(t >t) = 1 and I(t < t) = 0. So after 7 there is a trend, but not before. Then 


Z= I(t>T)N pPZ-1 Uy — 01u — O22 


Christensen undertakes estimation with the series up to 1980 (T; = 61) and ‘extended’ 
forecasting beyond then, for the seven years up to 1987. Within the estimation series, 
one step ahead forecasts are obtained as 


Ynew,t 1 = Znew,t Yt 


where Znew, z is sampled from the model for the z,. Extended forecasts assume u, sampled 
from the same density as the innovations during the estimation period. Both sets of 
forecasts are compared with actual data via a squared error criterion. 

Because BUGS uses the centred Normal form, it is necessary to assume an additional 
measurement error e; to model the moving average terms such that, for the first model, 


Zt= JN PZ-1 Uy — iui — OU €r 


The variance of e; is taken to be less than that of the innovations u; and use beta prior is 
adopted for \ = var(e,)/var(u,). An extra parameter B ~ U(1, 1000) is introduced to 
guide the decomposition of variance so that 


à ~ Beta(1, B) 


On the AR and MA parameters, priors in line with stationarity and invertibility are 
assumed. An unconditional estimation, including all observations in the likelihood, is 
used, so that a latent (differenced) observation zo, and latent innovations uo and u_; are 
referred to. The latent innovations are assumed a priori to follow a Student ¢ density 
with v = 2 degrees of freedom and with variance five times greater than the main 
innovation series. zo is taken as a distinct fixed effect. 

Christensen obtains maximum likelihood estimates (with SEs) of 


à = 6.3(7.6), p = —0.42(0.23), 6; = —0.47(0.22), 05 = 0.38(0.16) 


and forecasts under ARIMA(2, 1, 2) intercept model as shown in Table 5.2. Forecasts 
under the no-intercept model fluctuate between 780 and 785 million short tons right 
through from 1981 to 1987. 

For the first model in Equation (5.18) a three chain run of 50000 iterations (5000 
burn-in) gives an estimate for J, around 0.0273. Hence the Bayes factor against an 
intercept is approximately 35. This model yields similar forecasts to those cited by 


Table 5.2 Threshold model forecasts, US coal production 


Year Mean St. devn. 2.5% Median 97.5% Actual Christensen 
(1989, p. 243) forecasts 
1981 808 54 701 807 915 818 787 
1982 827 80 671 827 986 834 796 
1983 844 92 667 843 1029 718 798 
1984 863 105 661 861 1078 892 804 
1985 881 117 657 879 1122 879 808 
1986 900 130 651 896 1170 886 812 


1987 918 142 647 914 1216 913 817 
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Christensen for the no intercept model. The prediction for 1987 is 778 with 95% credible 
interval (560,1010). 

The threshold model for the intercept (Model B in Program 5.3) has a considerably 
better predictive loss error within the series, and improved out of sample forecasts. It 
finds the threshold r to average 40, corresponding to the year 1959. The intercept \ then 
averages 22, which may be compared to the average increment z; from 1960-61 to 1979- 
80 of around 20.5 million tons. Under this model the lag parameter p and first MA lag 
6; are smaller absolutely than obtained by Christensen, with posterior means and SDs 


p = —0.20(0.22), 6; = —0.21(0.27), 62 = 0.45(0.21) 


This model predicts the 1987 production to be 918 million short tons (with 95% interval 
647 to 1216) against an actual 913. One might envisage other options for this series, such 
as a dynamic intercept for this model (see Section 5.5). 


Example 5.4 Investment levels by firms Maddala (1979) compares several procedures, 
including maximum likelihood, in the estimation of investment levels by a set of US 
firms. This analysis illustrates autocorrelated errors as in Equation (5.6). Maddala’s 
investigation derives from an earlier study by Grunfeld and Griliches (1960) on the 
validity of using aggregate data to draw inferences about micro-level economic func- 
tions (e.g. consumption and investment functions). The aggregate is defined by ten 
firms, while the autoregressive errors model is applied to the yearly investment time 
series (1935-54) of a particular firm, General Motors. The predictor series (C;, V;) is 
available from 1934. 


The model relates General Motors’ gross investment y; in year ¢ to lagged levels of the 
firm’s value V;_; and capital stock C,_1; thus the generic x, in Equation (5.6a) is here 
defined by x; = {V;_1, C;-1}. Maddala assumes AR(1) dependence in the errors leading 
to a specification for years 1936-54: 


Ve=Bo Bii B2C-1 & 
Et — YEr-1 = Ut 


with u; ~ N(0, t7!) being unstructured white noise. This model can be expressed in the 
form of (5.7), giving the model 


v= YW BoA-y) Bii YV) BolCH-1-yG-2) u (5.19) 


The first model (Model A) assumes stationary errors g, and a uniform prior on the AR 
parameter is assumed, namely p ~ U( — 1, 1). The model for the year 1935 (t = 1) can 
then be written 


Yı = Bo BiVı BoGer & 
E ~ N(0, 1/71) 


qi= —)r 


The prior for the intercept By is set to be appropriate for a series with values exceeding 
1000. One step ahead forecasts in Equation (5.6a) require knowledge of x; 1, and use 
current (years t) values of y, V and C. 

The posterior estimates of the parameters B; and B, (from a three chain run of 20000 
iterations with 1000 burn-in) are close to the maximum likelihood estimates cited by 
Maddala. They show both coefficients to be above zero, with B; and B, having means 
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and 95% credible interval 0.086 (0.045, 0.12) and 0.42 (0.28, 0.57), respectively. The 
autoregressive coefficient y is estimated to have mean 0.74 with 95% credible interval 
(0.34, 0.98). 

A second model (Model B) avoids assuming stationarity in the error process at the 
outset (Zellner and Tiao, 1964). Accordingly, the prior on y is not confined to absolute 
values under 1, and a Normal density with mean 0 and variance 1 is adopted instead. 
The model for the first observation (for the year 1935) now treats it as separate effect 


yi ~ NU, V1) 


with V; set large and mean, following Equation (5.19), of 


Bi = Bol- y) BiVi-1 BoC-1 M 


where M is a composite parameter represents the missing term y(yo — Bxo). This shows 
the posterior of 8; unchanged, but the mean effect of C,—1 is reduced, with 95% interval 
(—0.11, 0.68) now straddling zero. There is a 35% chance of y exceeding 1, and its mean 
is 0.93 with 97.5% point 1.28. The within series forecasts under this model are shown in 
Figure 5.4, and trace the observed series reasonably well. The value for 1953 is under- 
predicted and this feature is worse under the non-stationary model; the result is that the 
marginal likelihood based on the CPO statistics is lower by —1.7, and the pseudo Bayes 
factor on Model A as against B is 5.3. 


Example 5.5 Swedish unemployment and production: bivariate forecasts To illustrate 
VAR models involving autoregression in the endogenous variables, consider the bivari- 
ate series analysed by Kadiyala and Karlsson (1997). This consists of quarterly data on 
unemployment and industrial production in Sweden between 1964.1 and 1992.4, with 
the production index in logged form. The analysis follows Kadiyala and Karlsson in 
only considering the latter part of the series (here from 1978.1 onwards). A lag four 
model in each component of the bivariate outcome is adopted. It may be anticipated 
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Figure 5.4 Within series forecasts 
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that some of the cross-variable lags are negative, because unemployment increases as 
production falls. This model is in fact likely to be most effective in short term forecast- 
ing of unemployment, since influences on production (e.g. measures of consumer 
demand) are not included in the model. There are also seasonal influences on both 
variables and seasonal dummies are therefore included as exogenous predictors. Here 
the 52 quarters of the series from 1979.1 to 1991.4 are analysed (conditioning on 1978.1 
to 1978.4) and then predictions made for the remaining year. 


Kadiyala and Karlsson take the undifferenced series with N(1, 1) priors on the own 
first lags, and N(0, 1) priors on the other own-lag coefficients, and for all the cross 
variable lags. Experimentation here with an undifferenced series analysis showed poor 
identifiability and convergence on the cross-variable lag effects. Identifiability was 
improved by modelling aspects of trend, such as by polynomials in time to take account 
of the upward shift in unemployment in the early 1990s (Figure 5.5), following a decline 
during the 1980s. However, convergence was still problematic. 

By contrast, analysis of differenced series was less subject to such problems. For this 
analysis N(0, 1) priors are taken on all lags, including first own lags. There are no 
stationarity constraints. Priors on seasonal effects take the first quarter as reference 
(with coefficient zero). 

In Model A, all cross-variable lags (unemployment on production and vice versa) are 
included, and the ux: are taken as multivariate Normal. With a two chain run taken to 
10.000 iterations (and burn-in of 1000), the clearest defined lags are for unemployment 
on its own third and fourth lags, with posterior means (standard deviations) of 0.34 
(0.15) and 0.28 (0.16). All cross lags of unemployment on production are negative, with 
the mean of the second cross lag being around —1.20 (s.d. =0.78). Lags of production 
on unemployment are insignificant (close to zero). The cross-variable correlation in the 
errors ux, is estimated at around —0.15 (with 95% credible interval (—0.44, 0.18). 
Forecasts of unemployment in the short term match the rise during 1992 (Figure 5.6). 
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Figure 5.5 Undifferenced unemployment series 
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Unemployment Difference (from prev qrtr) 
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Figure 5.6 Unemployment in 1992 


In a reduced model (Model B), the lags of production on unemployment are set to 
zero. This produces an improvement in marginal likelihood (and in other fit criteria) 
and no great change in the pattern of remaining coefficients or the forecasts for 1992. 
In both models the clearest outlier is at 1982.4, with unemployment difference 
3.1 — 3.6 = —0.5. 


5.3 DISCRETE OUTCOMES 


For discrete outcomes, dependence on past observations and predictors may be handled 
by adapting metric variable methods within the appropriate regression link. Thus for 
Poisson outcomes 


Yı ~ Poi(p,) 
an AR(1) dependence on previous values in the series could be specified 
log (m,) = py-1 Bx (5.20) 


Here, non-stationarity or ‘explosive’ behaviour would be implied by p > 0 (Fahrmeir 
and Tutz, 2001, p. 244), and in an MCMC framework stationarity would be assessed by 
the proportion of iterations for which p was positive. Autoregressive errors lead to 
specification such as 


log (p) = Bx, & 
with 
Et = YEr-1 Ut 


for t > 1, and u; being white noise. 
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For binary data a similar framework would involve lags in the binary outcomes. 
Thus, if y; ~ Bern(m;), then one might have a regression model with a single lag in y; 
such as 


logit(m,) = Bx; = py-1 


For multi-category data with K categories there are K — 1 free category probabilities, 
and these might be related to lagged values on up to K — 1 dummy variables. This leads 
to models similar to VAR(p) models for multivariate metric outcomes, in that there are 
‘own’ and ‘cross’ lags. Thus, Pruscha (1993) outlines suggests the scheme 


yı ~ Categorical(P,) 


where both y, and P, are of dimension K, and the probability P, that the tth value of the 
series is in category k(k > 1) is a function of dummy indicators D1, x = 1 if Yi = k. 
The probabilities may be made functions of the previous response via a model 
such as 


P = exp (BkXt ProDi-1,2 ++ - PexDi-1, K)/ 
[1 5 exp (BkXrt PkoDri,2 «+ Pk, KDr1, K)] 
k 


and the lag coefficients model transitions between states. 

While the canonical links (e.g. log for Poisson data and logit for binomial data) are 
the most common choices, one may regard the link as a modelling area in its own right. 
For binomial data, the modelling of a choice between skewed alternatives to the logistic 
model, as in Lang (1999) was considered above (Section 3.3.4). Another methodology 
(Mallick and Gelfand, 1994) is applicable to modelling all kinds of link; for a Poisson 
with ‘base’ log link, it involves a transformation from the R! scale of n = log (u) to a 
(0, 1) scale in which beta mixture modelling can be carried out — for instance, 
Jm) = exp (n)/[1 exp (n)]. 

Modelling the link for count data is also possible in models involving multiplicative 
lags in a transformed version of y;, with mean specified as 


w: = exp BO" 
with possible transformations being, as in Cameron and Trivedi (1999), 
Vast c€ (ESO) (5.21a) 
or 
Yi1=max(c, y1) O<ec< 1) (5.21b) 


Either c would be an additional parameter or taken as a default value such as c = 0.5 or 
c = 1. A generalisation is to models such as 


(1,)" = exp xp)" 


Another approach, especially for binary and categorical time series, invokes an under- 
lying metric variable. This approach has been suggested for Poisson count data (van 
Ophem, 1999), but the most frequent application is for binary data using the method of 
Albert and Chib (1993) and Carlin and Polson (1992). Thus, for binary y;, a positive 
value of the latent series Y* corresponds to y; = 1 and negative values of the latent 
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series to y; = 0. One might then assume an underlying true series x;, largely free of 
measurement error, such that 


* 
Yi =x v 


Xt = PXr-1 Ut 


with |p| < 1 corresponding to stationarity. 


5.3.1 Autoregression on transformed outcome 


For Poisson or binomial data, it might be sensible that the lagged value of the outcome 
is in the same form as the transformed mean of the current outcome value. Thus a 
symmetry with the log link for Poisson counts would involve a lag in the log of a 
transformed version of y;_1, 


log (u,) = Bx, plog O) 


where y, is defined as above. 

If one wished to consider extended lags or moving average effects for frequent 
binomial events or counts, then unmodified ARMA methods — applied as if the 
outcomes were effectively metric, and using Normal approximations for the binomial 
or Poisson — may be appropriate. However, there are potential problems in applying 
standard ARMA models to count data since the assumption of Normality (or of any 
symmetric density) may not be appropriate, especially for rare events. 

If the autocorrelation is postulated in the regression errors, then a full model might 
take the form, for a Poisson outcome 


yr ~ Poi(p,,) 
log (m) = Bx &; 
Et = YEr] Ut 


However, a reduced parameterisation, excluding errors u, may still reproduce the 
essential aspects of the alternative form (5.7). Following Zeger and Qaqish (1988), one 
might propose, for a Poisson outcome, the model 


log (m) = Bx, yogy; — Bxr-1) 
with y,_; as in Equations (5.21a)-(5.21b). A lag two model would then be 
log (m) = Bx; yı(log yi — Bx,-1) Y2(log y!» — Bx;-2) 


and ‘moving average’ terms would compare log y,_; with log p,_; so that an ARMA(I, 1) 
type model would be 


log (m) = Bx; yUogy_;—Bx-1) O(log y_, — log p,_2) 


5.3.2 INAR models for counts 


Integer valued autoregressive (INAR) schemes are oriented to discrete outcomes, and 
have a close affinity with ARMA models for metric outcomes (McKenzie, 1986). In 
particular, their specification often includes devices to ensure stationarity (e.g. of the 
underlying mean count through time), though a Bayesian approach may make this 
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constraint less necessary. INAR schemes introduce dependence of the current count y; 
on previous counts y;_1, ¥;-2,--, and also allow an integer valued innovation series w; (a 
form of random shock). The autoregressive component of the model can be seen as a 
survival model to time ¢ for each particle in the previous overall counts y;_1, y;-2, etc. 

Thus, for an INAR(1) model, one considers the chance p that each of the y,_; 
particles survives through to the next period. If, say, y,_1 = 4, this amounts to conceiv- 
ing a Bernoulli model Bern(p) for continuation of each of the four particles. The 
autoregressive component of the INAR(1) model for y; is 


Yt-1 


C = se Bern(p) 
k=l 
Equivalently, C, is binomial with y,_; subjects, and p the probability of success. This 
approach to autoregression for non-negative integers is known as a binomial thinning 
operation, and is denoted as 


p°Yr-ı 


An INAR(2) process would refer to two preceding counts, y; and y;_2, and involve 
two survival probabilities, p; and p). Note that for an INAR(p) process, stationarity is 
defined by 


p 
2<! 
k=l 


(Cardinal et al., 1999). For overdispersed data, McKenzie (1986) suggested that the 
‘survival probabilities’, such as p,, in an INAR(1) model, be time varying, and one 
might then envisage autoregressive priors on these probabilities. 

If y;-1 = 0 then there is no first order lag autoregressive component in the model for 
yı. In BUGS the binomial can still be used when y; = 0, so one can code an INAR(1) 
model using the binomial, rather than program the full thinning operation. Note that an 
INAR model requires an initialising prior for the first value yı of the series, and 
McKenzie (1986) proposes 


yı ~ Poi(8) 


As well as survival of existing particles (which might have time-varying covariates 
attached to them so that pı, is modelled via logit regression), there is an ‘immigration’ 
process, analogous to the innovations of an ARMA model for a metric outcome. Thus, 
new cases w; are added to the ‘surviving’ cases from previous periods. McKenzie and 
others have envisaged y; as then being the summation of two separate Poisson processes 
with different means. Thus, McKenzie’s INAR(1) model has 


Ye =P Y Wi (5.22) 


where the mean of w; is Poisson with mean 6(1 — p) to ensure stationarity in the mean 
for y. 

One might also consider unconstrained Poisson densities for w, (not tied to p in an 
INAR(1) model), especially if there is overdispersion. Thus Franke and Seligmann 
(1993) propose a mixed Poisson with two possible means à; and M for w; in an analysis 
of epileptic seizure counts. Switching in the innovation process at time ¢ is determined 
by binary variables Q, (which may in turn be drawn from an overall beta density). 
Another option is to allow the mean of the w; to be time dependent. 
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One might also envisage (in terms of its compatibility with the BUGS computing 
environment) having a single mean p, for y;, but composed of the survival term p,;°v;-1, 
and an additional series w,, that follows a positive density (e.g. a gamma). For a 
stationary INAR(1) model, this leads to 


Yı ~ Poi(u,) 
m =p Y o (5.23) 
w: ~ G(bA(1 — p), b) 


5.3.3 Continuity parameter models 


Harvey (1989) and Ord et al. (1993) also propose a model for count series combining 
two sources of randomness. One concerns changes in the underlying level (as does the 
survival term of an INAR model) and the other refers to the distribution of observa- 
tions around that level. Thus, with Poisson sampling, the mean is itself gamma distrib- 
uted with time evolving parameters (a;, 5;), 


yi ~ Poi(u,) 


(5.24) 
M, ~ Glar, bi) 


The gamma parameters are related to previous parameters (a;—1, b;-1) via a common 
continuity parameter þ. This takes values between 0 and 1 that applies to both scale and 
index of the gamma. Thus 


ay = þar ı 
b, = pbi 


The initial values ag, bo are assigned a prior ensuring positive values (e.g. log-normal or 
gamma). To avoid improper priors for later time periods, one may modify Equation 
(5.25) by adding a small constant — indicating the minimum prior scale and index in the 
prior for the w,. This might be taken as an extra parameter or preset. Thus 


(5.25) 


ar = þa € 


bi= pbm c 


It is also possible to drop the constraint  < 1 and assess the probability that ¢ is in fact 
consistent with information loss (i.e. with accumulated discounting of past observations 
as  < 1 implies). 


5.3.4 Multiple discrete outcomes 
The approach of Equations (5.24)-(5.25) may be extended to multivariate count series, 
Yke, k = heh t= lT 


by modelling the total count at time t 
Y, = X yet 
k 


in the same way as a univariate count with parameter bı. The disaggregation to the 
individual series is modelled via a multinomial-Dirichlet model, with the evolution of 
the Dirichlet parameters governed by a second parameter Q3. 
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Jorgensen et al. (1999) also consider a method for multiple count series y1;, y2;,--VKr 
observed through time, but in terms of an underlying univariate latent factor 0, that 
accounts for much of their interdependency. The components of the multivariate series 
are conditionally independent given 0;. Thus 


Vee ~ Poi(vy,9;) (5.26) 


where vz; is a term dependent on regressors X; relevant to the kth outcome. The 
regression is typically applied via a log link: 


log (ver) = By Xx 


and B, differs by outcome. 
The series are envisaged as related (e.g. counts of diseases caused by similar risk 
factors), and the latent process 8, evolves as a gamma Markov process 


0, ~ G(cr, di) (5.27) 


The gamma density has the benefit of conjugacy with the Poisson as in the contin- 
uity parameter models above, and the evolution of {c;, dı} might follow the stationary 
scheme of Section 5.3.3. However, Jorgensen et al. (1999) propose a nonsta- 
tionary scheme, whereby 


ci = 8-1/0" (5.28a) 


d; = 1/[b,07] (5.28b) 


where g? is a variance parameter, and the mean of 0, given preceding values 00, . . 0;-1 is 
b,6,_-1. So, one may modify the impact of the preceding latent value by a regression of b; 
on additional covariates z,. To ensure b; positive, one may take 


log (b;) = az; (5.28c) 


Jorgensen suggests z; be defined in terms of first order differences z; = AZ, = Z; — Z;-1, 
where Z, are viewed as long term influences, and the X, as short term. 


Example 5.6 Meningococcal infection As an example of the INAR(p) process, 
consider a series of 104 counts in 28 day periods of meningococcal infections in Quebec. 
These span the period 1986-1993. The maximum observed count in any single period 
is six. As Cardinal et al. (CRL, 1999) note, the autocorrelation and partial autocorrela- 
tion functions of the (undifferenced) count series are virtually indistinguishable from 
a series of random shocks, except for a slight lag at five months. Using the model 
of Equation (5.22), but with w; as a general Poisson process (not linked to the p 
coefficients), CRL confirm the significant effect at lag 5. They find a Ljung—Box 
portmanteau statistic of fit with a non-significant vale (p = 0.63 at the maximum 
likelihood estimate). 


Here the specification in Equation (5.23) is adopted, and following CLR an INAR(5) 
model is estimated for the first seven years’ data (n = 91). The ‘autoregressive coeffi- 
cients’ p, are assigned a beta, B(1, 1), prior. For additional flexibility (Model A), the 
innovation error w, is modelled as an exponential density with changing parameter A,, 
where log A, is modelled as a first order Normally distributed random walk. An 
alternative assumption on the innovation errors is that they are independently gamma 
with parameters a; and a2 (Model B). Forecasts are made for the 13 periods of 1993 for 
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each model, though the coding in BUGS is not very elegant because stochastic quan- 
tities are not allowed as indices in sums or do-loops. 

As one check on fit, the Ljung—Box criterion is derived for L = 1, 2,...22 lags. 
A posterior predictive check (Gelman et al., 1995) is incorporated based on 
sampling new counts from the model mean and deriving a ‘new’ data Ljung- 
Box criterion. This is compared to the real data criterion, and the predictive check 
is based on the proportion of iterations where the new criterion improves on the real 
one. 

Posterior summaries from the second half of a three chain run of 5000 iterations 
show a similar impression of a lag at five months, but also more apparent lower 
lag effects (i.e. with lower standard deviations) than in the analysis of CRL. The 
posterior means (and s.d.) of the lag coefficients are p, = 0.15(0.09), p, = 0.09(0.07), 
p3 = 0.12(0.09), p4 = 0.15(0.10), and p; = 0.24(0.12). The minimum value of the Ljung- 
Box statistic is 7.3, roughly approximating the maximum likelihood value. The Ljung- 
Box statistic is approximately chi-square with L — p( = 17) degrees of freedom if the 
model is appropriate, so there is no evidence of lack of fit. The predictive criterion is 
close to 0.5 (around 0.58), and indicates the same. 

Model B produces similar estimates of the coefficients, and similar short term 
forecasts (Figure 5.7). Its marginal likelihood, namely —151.2, is slightly better 
than Model A at —152.4. The most suspect observation under both models is the 
seventh, with five cases, compared to noughts and ones in adjacent periods. 


Example 5.7 England vs. Scotland football series To illustrate the continuity param- 
eter model, consider the goals scored by England in the England vs. Scotland inter- 
national, which has been running since 1872. Excluding war years, there were a total of 
T = 104 matches up to 1987. A subset of this series (England away games) is analysed 
using the Poisson-gamma model of (24)-(25) in Harvey and Fernandes (1989). A 
uniform prior on the continuity parameter ¢ is assumed. Flat gamma priors are taken 


Forecast 


— Model B — Model A 


Figure 5.7 Forecast cases in 1993 
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on the initial conditions {ao, bo}. This defines Model A. As an alternative model for 
these data, an INAR(1) model is also considered. 


A three chain run of 5000 iterations on model A leads to a 95% interval for from 
0.91 to 0.99 with median 0.93. By contrast, Harvey and Fernandes obtain a value of 
0.844 for the 53 matches played away by England at Hampden Park. The forecast value 
for England’s score in the next match following the observed series (for year 
T 1= 105) is 1.73. The predictive probability distribution (obtained by extracting 
the iterations of y.new in Program 5.7) has 36% of the distribution being for zero goals. 

An illustration of the fit to the observed series is provide by matches 71-80, where 7 
and 9 goals were scored in the 72nd and 78th matches. These are the two clearest 
outliers using a CPO criterion. The 95% intervals do not include these exceptional 
scores, though the raised means of 2.3 and 3.1 reflect them (Figure 5.8). 

The INAR model has a slightly lower pseudo marginal likelihood for this example 
(—182.2 vs. —180.9 under Model A), and provides a lower (posterior mean) forecast for 
the next score, namely 1.56. The lag ‘coefficient’ p is estimated at 0.097 (s.d. 0.057). 


Example 5.8 Polio infections in the USA Fahrmeier and Tutz (1994) and others have 
analysed all or part of a time series y; of new polio infections per month in the USA 
between January 1970 and December 1987. A question often raised with these data is 
the existence or otherwise of a linear trend in time, after accounting for seasonal 
variations. These are represented by sine and cosine terms as in Equation (5.13), namely 


(a) cosine of annual periodicity, beginning with 1 in January 1970 (i.e. frequency 1/12); 

(b) sine of annual periodicity, beginning with 0 in January 1970; 

(c) cosine of semi-annual periodicity, beginning with 1 January 1970 (frequency 1/6); 
and 

(d) sine of semi-annual periodicity, beginning with 0 in January 1970. 


Actual and Fitted 


Match 
—— ModelB —— ModelA -—- Actual 


Figure 5.8 England vs. Scotland 
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We consider a linear trend model including these periodic impacts, and follow Fahr- 
meier and Tutz in assuming lags up to order 5 in the count itself, as in Equation (5.20). 
This model therefore has 11 parameters (intercept, trend, seasonals and lags) and 163 
time points. 

To illustrate forecasts with this approach, the model is fitted till the end of 1983 and 
forecasts made beyond then. Estimation is conditional on the first five observations, so 
that no model is required for the five latent data values (yo, y_1,..y_4). From a two 
chain run of 5000 iterations, with 


yı ~ Poilu) t= 6, 168 
the model coefficients (posterior means and standard deviations) are 


log (m) = —0.14 -—0.0032¢ — 0.22 cos (27t/12) — 0.47 sin (27t /12) 


(0.19) (0.0015) (0.11) (0.11) 
0.13 cos (21/6) — 0.37 sin (2mt/6) 0.084y, 1 0.038), 
(0.11) (0.11) (0.03) (0.04) 


—0.044y,3  0.028),-4 0.079), 
(0.04) (0.04) (0.03) 


The 95% credible interval for the time effect is confined to negative values, namely 
(—0.0062, —0.0003), and so supports a downward trend. By contrast, Fahrmeier and 
Tutz (2001, p. 254) find only a 9.5% probability that the maximum likelihood estimate 
of this parameter is significant. 

Some of the coefficients in the full model above are not well defined. A second option 
in Model A allows for coefficient selection or exclusion using the Kuo—Mallick method. 
Inclusion rates are found to be lowest for lags 3 and 4, though in fact are below 5% for 
all coefficients except the time term, the sine effects and the first lag in y. 

In the forecasting model (Model B in Program 5.8), the lag 3 and 4 coefficients are 
accordingly set to zero, though further model reductions might be made. The actual 
counts of cases beyond 1983 are generally 0 or 1, exceeding 1 in only 3 of the 48 months, 
and the forecast counts are all between 0.3 and 1.5 in line with the trend to lower 
incidence. 

As an illustration of the latent variable approach of Section 5.3.4 to these data, 
consider again the series from 1970 to 1983 (there is only one series so K = 1). Jorgensen 
et al. take Z, = t to model the trend demonstrated by the above analysis so that the 
‘long term’ model reduces to using a constant AZ, = Z, ı — Z, = 1, the coefficient of 
which, as in Equation (5.28c), represents the trend coefficient. The standard deviation o 
in Equation (5.28a) is assigned a uniform (0.001, 100) prior — see Model C in Program 
5.8. The initial parameter 0o is assigned a diffuse gamma prior. 

Convergence with a two chain run is achieved after 12000 iterations and from the 
subsequent 4000, the posterior mean of o averages around 0.25. The short-term effects 
are modelled as the coefficients of cos(2mt/12), sin(2mt/12), cos(2mt/6) and 
sin (27t/6), and estimates for these are similar to those reported in another analysis of 
these data by Chan and Ledolter (1995). 

The trend coefficient does not appear significant under the Poisson-gamma model of 
Section 5.3.4. Jorgensen et al. argue that lack of evidence for a trend is accounted for by 
the nonstationary nature of the latent process 0,, whereas Chan and Ledolter use a 
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stationary AR(1) process for the latent series. Specifically, the model of Chan and 
Ledolter (see Model D in Program 5.8) is 


Y, ~ Poi(p,) 


log (m) = BX, & 
E€ = VEr1 U 


u, ~ N(0, 07) 


In BUGS this alternative AR(1) model may be implemented using the centred Normal 
form for €,, namely £; ~ N(yé,-1, 02), and we obtain an estimate for y around 0.72. The 
trend coefficient, namely the coefficient on time within B.X;, is mostly confined with this 
model to negative values (consistent with a downward tend in polio cases), with 95% 
interval {—0.010, 0.002}. Comparison of Model D with Model C in terms of pseudo- 
marginal likelihoods gives a Bayes factor in favour of D of around 5. 


5.4 ERROR CORRECTION MODELS 


Time series regressions in economics and elsewhere are defined in statistical terms 
by questions of identifiability and parsimonious parameter choice, but also by their 
being sensible in terms of the substantive application. It is in this sense that apparently 
significant regression relationships involving non-stationary outcomes y; and predictors 
Xin X27,..-may in reality be subject to a spurious correlation problem. If one or 
more explanatory variable shows a distinct trend (i.e. is nonstationary), and so does 
the outcome, then an apparently significant relationship may occur even if y 
and the putative predictors are independent. Granger and Newbold (1974) considered 
spurious regression relationships between two nonstationary series y, and x;, separately 
generated by random walk processes with changing variances, while Nelson and 
Plosser (1982) show that many economic time series do appear to follow random 
walk error processes. If a series is a first order random walk (or effectively so 
with autocorrelation p, in Equation (5.1) of, say, 0.99) then its first difference is 
stationary. 

The analysis of Granger and Newbold was of ‘integrated’ series, namely those not 
differenced in an attempt to gain stationarity. Consider a regression for such series 


¥i=Bo Bix uw (5.29) 


where successive u; are taken to be non-correlated. Then taking first differences in y; or 
Y, = log(y,), and in the predictors, may reduce the chance of spurious regression 
findings, and is often the recommended method for modelling non-stationary variables. 
Under differencing one obtains the model 


Ay, = BAX, uy) — uni (5.30) 


which in fact defines a moving average error. 

Suppose, however, that Equation (5.29) represents a long-term equilibrium relation- 
ship (e.g. between consumption and income), with the errors representing temporary 
disequilibria. In this case differencing on both sides will be considering only the short 
run dynamics of the underlying relationship. If a long run association holds, then it is 
likely that changes in y will depend not only on changes in x, as assumed in Equation 
(5.30), but also on the long term relationship (5.29) between y and x, and on the 
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disequilibrium between y and x at lags t— 1, t — 2, etc. Moreover, estimates of trans- 
formed data models such as Equation (5.30) may in fact show significant intercept 
terms, implying that y continues to change, even if Ax, = 0. 

In practice, models such as Equation (5.29) may be inappropriate for representing 
long-term relationships, because of delays in the adjustment of y to changes in x. A 
model allowing lagged impacts of both y and x on the outcome is more likely to be 
observed, with form such as 


Yr= we OX, boxe. PY- u (5.31) 


This is known as an Autoregressive Distributed Lag model, denoted ADL or ARDL, of 
order (1, 1) (Bauwens et al., 1999, p 136; Greene, 2000, p. 724). To avoid problems 
raised by y; and x, being non-stationary, Equation (5.31) may be rearranged by first 
substracting y,_; from both sides and then subtracting b;x;_; (from b,x; and adding it 
to byx;_1) to yield 


Ay,=p bj Ax, (bi b2)xX-1-CU-p)yi-1 u 


Setting 8; = (bı b2)/(1 — p), and 8 = p/(1 — p), then gives a further restatement of 
Equation (5.31) as 


Ay, = Ax, (p—DOr-1-80-81%-1) ur (5.32) 


The term multiplied by a =p—1 may be regarded as modelling the negative or 
positive feedback of a disequilibrium error from the previous period, and so describes 
an error correction mechanism. So Equation (5.32) may incorporate the underlying 
long-term relationship more effectively than Equation (5.30), and may therefore per- 
form better in long-term forecasting. One motivation for using an ECM version of 
a model is to incorporate prior information, for example on economic multiplier effects 
of x on y which appear explicitly in the ECM version of an ADL model, but not in its 
original version as in Equation (5.31) (Bauwens et al., 1999). 

These ideas can be extended to the case where the hypothesised equilibrium relation- 
ship, analogous to Equation (5.29), involves two predictors: 


Ye = Bo Bix, Boz t (5.33) 


The typical observed short-run (disequilibrium) relationship, parallel to Equation 
(5.31), might then be 


Yı=bo bixı bxi CZ CZ- PY Ur 
Re-expressing in ECM form then leads to 
Ay; = biAx;, Az, (p> 1I)Q1-1 — 80 — ÒX — 22-1) Ur 


Cointegrated series 

One situation where error correction specifications are implied is when the outcome y, 
and predictor(s) are co-integrated. Co-integration is the statistical expression of a stable 
long run relationship between y; and a predictor x,, and may be defined in terms of the 
stationarity or otherwise of y, x and the error in models such as Equation (5.29) and 
(5.33). Thus, if a series y; follows a first order random walk, then its first difference is 
stationary, and y; may be termed an integrated process of order 1, denoted /(1); 
however, if y; is stationary without differencing, it is considered to be integrated of 
order zero, J(0). 
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Two nonstationary series y; and x, are cointegrated if they are both /(1) but the errors 
u;, as in Equation (5.29), 


ur = yt — Bo — Bix 
are stationary, that is (0). This is because if y and x are governed by a long-term 
relation then any disequilibrium error should hover around zero (Engle and Granger, 
1987). If, by contrast, y; and x; are not cointegrated, then the u, are J(1), and non- 
stationary. Whether or not the u; are stationary may be assessed by regressing u; on u;_1: 


ur = Mui Ny 


and if |A| < 1, then u is stationary. 


Example 5.9 Voting intentions Although economic relations form the major applica- 
tion for ECM models, they have also been applied to political affiliations and attitudes. 
Clarke et al. (1998) investigate changes in voting intentions for a future General 
Election in Britain, specifically the proportion intending to vote for the Labour Party 
in the five years preceding the May 1997 election victory for Labour. Thus for 64 
months from January 1992 to April 1997, the proportion intending to vote Labour 
was related to factors such as: 


(a) underlying Labour party identification (iden[] in Program 5.9); 

(b) the proportion of respondents seeing the Labour leader as the best potential prime 
minister (bpm[] in Program 5.9); 

(c) expectations of personal economic welfare (the ‘feel good’ factor denoted fg[ ]); 

(d) economic indicators such as unemployment and interest rates (denoted intr[ ]); 

(e) perceptions of the most important problem (inflation, unemployment, health, etc.). 
The proportions seeing unemployment and inflation as most important are denoted 
mipu[] and mipil]. 


Also used to explain voter intentions are short-term temporary issues or ‘events’, such as 
political scandals and interest rate crises. 

Clarke et al. argue that a long-term cointegrating relationship exists between Labour 
voting intentions, and factors (a) and (b) above, namely Labour party identification and 
seeing the Labour leader as best Prime Minister. Therefore it is necessary to introduce 
an error correction 


ECM, = LAB, — ò LBPM, — è&LPID, (5.34) 


into the voting intentions model. Their preferred model (Clarke et al., Table 3, Personal 
Prospective Model) selects from a fuller set of predictors and has the form (for two 
temporary events) 


ALAB; = By BiALBPM, B,LPID;2 oaECM;; B3AFG, 
B4AINTR, B;sAMIPI, B;AMIPU, (5.35) 
B7EVENT1, BgEVENT2, u 
The events are handled as short-term effects, and since the model is for differenced 
variables, a dummy is coded 1 in the month when the event occurs (or is assigned to 


have occurred), —1 in the next month, and 0 otherwise. Two events assumed likely to 
boost Labour general election voting intentions are included, namely the Conservative 
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election scandals of March 1997, and the European elections of June 1994 with a large 
Labour victory. (The model of Clarke et al. included several other events of this type.) 

The results obtained with this model (Table 5.3) are similar to those of Clarke et 
al. (1998) in terms of the parameters in the ECM itself, namely 6; and ò in Equation 
(5.34), and the negative sign on the ECM coefficient a in the main model Equa- 
tion (5.35). 

The relatively small size of a (Clarke et al. obtain a = —0.15) translates into a long 
‘half-life’ for the delayed negative feedback of factors included in the generic term x;_; 
in Equation (5.32). For example, a negative ‘event’ which reduced the Labour voting 
intention by three points in month ¢ would be carried over to a reduction of 
(3 — 3 x 0.197) = 2.4 points in the next month, (2.4 — 0.197 x 2.4) = 1.9 points the 
next month, and so on. 


5.5 DYNAMIC LINEAR MODELS AND TIME VARYING COEFFICIENTS 


Classical time series methods assume fixed relationships between y; and x; and statio- 
narity (absence of upward or downward trend) in y+, or a transformation of y,, so that 
both intercepts and regression coefficients are fixed. In practice, relationships between 
variables are likely to vary over time: for example, the response of the birth rate to 
economic conditions in successive years, or of sales to advertising over successive weeks, 
is unlikely to be constant whatever the level of birth rate or the economy. Autoregressive 
moving average models with fixed regression effects are often of limited use in under- 
standing the processes which generated such relationships, and may best be used as a 


Table 5.3 Labour voting intentions, parameter summary 


Variable Cointegrating Regression Clarke et al. 


Mean St. devn. 2.5% 97.5% MLE s.e. 


(MLE) 
Labour Leader Best Prime 81 0.637 0.074 0.493 0.777 0.635 0.083 
Minister 
Labour Party Identification 8» 0.583 0.069 0.448 0.717 0.585 0.078 

Error Correction Model 

Intercept Bo 0.264 0.285 —0.298 0.833 0.47 0.24 
Labour Leader Best Prime 
Minister Bi 0.463 0.074 0.318 0.613 0.27 0.08 
Labour Party Identification B, 0.158 0.092 —0.020 0.340 0.19 0.09 
ECM a —0.197 0.084 0.366 0.034 —0.15 0.07 
Feel Good Factor B3 —0.047 0.049 —0.142 0.054 —0.08 0.04 
Interest Rates Ba 0.866 0.930 —0.955 2.713 1.4 0.86 
Inflation Most Important 
Problem B5 0.343 0.150 0.044 0.629 0.55 0.14 
Unemployment Most 
Important Problem Be 0.071 0.054 —0.040 0.175 0.15 0.05 
Euro Election B7 3.416 1.400 0.712 6.099 4.73 1.33 


Conservation campaign 
scandals Bs 3.551 1.342 0.846 6.238 5.2 1.31 
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first approximation to summarise the data in a parsimonious way for forecasting 
purposes (Diggle, 1990). They may also have limitations in forecasting itself: a model 
with good fit in the sample period may perform poorly after that if the underlying 
parameters are in fact evolving through time (West and Harrison, 1989). 

To model stochastic shifts in regression parameters one may call upon random effects 
models which specify time dependence between successive parameter values in the form 
of smoothness priors. For example, whereas a fixed coefficient time series regression for 
univariate y might be 


ye=5a BY, Pyri u 
a model with nonstationary levels and B coefficient might be specified as 
Ve=O Br PY- Us 


with the parameters evolving according to 


QQ, = Ay] Wir (5.36a) 
B= Bi 2 (5.36b) 
P; = Pr-1 W37 (5.36c) 


with the ‘state’ equations in as, B, and p, defined for t= 2,..7. The most general 
dynamic linear model has errors u, and wx, with time dependent variances V, and 
W, = (Wir, War, W3z), respectively. 

In practice, the variances V, and W;, may be taken to be constant over all time points 
t=1,..,7, so that there is a single level of volatility describing evolution in levels or 
regression effects. In this example, then one might take 


oi, ~ NO, W1) 
wz ~ N(O, W2) 
3, ~ NO, W3) 


where W,, Wn and W3 are hyperparameters to be estimated. It is also possible to 
assume that the wx, follow a multivariate form, with a constant dispersion matrix. 

The system is set in motion by vague priors on the levels, B coefficients, and the 
autoregressive parameters p in the first period. For example, one might assume the first 
period regression parameter has a prior 


By ~ Ni, Ci) 


where bı and Cı are both known (typical values might be bı = 0, Cı = 1000). 

The priors on a; and B, in Equations (5.36a) and (5.36b) are first-order random walk 
priors and are equivalent to taking the differences a; — a,;_; and B, — B;_; to be random 
with variances W1, Wz and W3. One might instead consider higher order differences to 
be random, e.g. take No, = A(a; — a1) = a; — 2&1 ;_2 as having zero mean and 
variance W1. In this case, the prior on the levels is a second order random walk and can 
be written (for t > 3) as 


ar = 20;-1 — O72 Ir 


To accommodate changing volatility without excess parameterisation, Ameen and 
Harrison (1985) suggest a discounting prior for the variances, which specifies a prior 
for the precisions at time 1, and then discounts later precisions to reflect a time decay 
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factor. So for the observation error variance and the variance of the levels at time 1 one 
might take 


VT! ~ G(s, t1), Wal ~ G(x, t2) 
Subsequent precisions are downweighted by a factor 6, where 0 < 6 < 1, and for t > 1 
eas =ô] vi 
W,,' = ôW; 


Other approaches to stochastic variances involve ARCH-GARCH and structural shift 
models, and are discussed below. 


5.5.1 State space smoothing 


A major use for state-space models is in semi-parametric additive regression, where the 
object is to construct a smooth ‘signal’ f(t) from noisy data y(t). In this case, there are 
no covariates, and f(t) describes the smoothly changing level assumed to underlie the 
observations on y(t) which are subject to measurement error. Another common model 
task is to achieve a smooth representation of the changing nonlinear effect of a 
covariate. There is a close link (e.g. Fahrmeir and Lang, 2000) between state space 
models for dynamic general linear models and general additive models involving 
smooth functions of metric predictors, as discussed in Section 3.6. The orientation in 
this chapter is to smoothing of time series per se, together with non-linear regression 
analysis achieved via time varying regression coefficients. 

To illustrate techniques for smoothing time series, consider a univariate series y(t) 
observed at equidistant points, t = 1, 2, 3,.. T. Then the model has the form 


yYO=fO ed) 
The e(t) are typically taken as exchangeable errors such as e(t) ~ N(0, o°), but the true 
series f(t) follows a random walk of order k. For example, if k = 2, 


fO=4C-Y)-fE-2) ud (5.37) 


with u(t) ~ N(0, 77). One may expect the conditional variance 7 of the true series to be 
less than that of the noisy series y(r), with the noise to signal ratio A? = o? /7? then being 
greater than 1, and 1/)7 being under 1. So a prior (e.g. gamma) on 1/)7 might be taken 
that favours small positive values. Alternatively, one might take a uniform U(0, 1) prior 
on the ratio o?/[o? 7]. Higher values of A? correspond to greater smoothing (as the 
variance 7? of the smooth function becomes progressively smaller). 

With k = 2 and diffuse priors on f(1) and f(2) in Equation (5.37), and with Normal 
metrical data y(t), the posterior means f(1), f(2)...f(T) provide values which minimise 
the penalised fit function 


T T 
SOO -fOP XY wor 
t=1 


t=3 


T T 
S29 -fOyr XA uor 
t=1 t=3 

T 

{ 
t=1 


T 
= WO-fOP VS) Ew -2fe-1)-f¢- DY 
t=3 
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Higher order difference priors may be used for seasonal effects. For example, for 
quarterly data, a possible smoothness prior is 


h(t) = s(t) s(t—1) s(t—2) s(t— 3) ~ NO, 7.5) 
For monthly data, the analogous scheme is 
A(t)= s(t) s(t—1) s(t—2) .... s(t—11)~N(O, 7.5) 


Instead of simple random walk priors, autoregressive priors involving lag coefficients 
),..., bg may be specified as smoothness priors. For example, an AR(2) prior in the 
true series would be 


SONAS- Wft- 2), 7°) 


Kitagawa and Gersch (1996) illustrate the use of such priors (with high order k) to 
estimate the spectral distribution of a stationary time series. 


Example 5.10 Asymmetric series This example illustrates the detection of a signal in 
noisy data when the form of the signal is exactly known. Thus, Kitagawa and Gersch 
(1996, Chapter 4) simulate a time series according to the truncated and asymmetric form 


y(t) =f(t) et) t=1, 200 


where the true series or signal is 
f(t) = (12/7) exp (— {t — 130}7/2000) 


and e(t) ~ N(0, 0”), where o? = 1. The maximum value of the true series is just 
under 4 at t = 130, with the true series being effectively zero for t < 50. Kitagawa and 
Gersch (1996, p.109) contrast the AIC fit values obtained with different orders k in the 
random walk smoothness prior, and select k = 2 on the basis of the AIC criterion, so 
that 


FO = 4-1 -fE-2) u) 


with u(t) ~ N(0, 77). The k = 1 model is found by Kitagawa and Gersch to be too 
ragged, while the smoothing obtained with values k = 2, 3, 4 is visually indistinguish- 
able. 

Here two alternative priors are set on the variance of 7? conditional on o”, one a 
uniform prior U(0, 1) on o7/[o? 17], the other a G(0.1, 0.2) prior on t/o. The latter 
prior favours values under 1 in line with variability about the signal being expected to be 
less than that around the observations. A G(1, 0.001) prior on 1/o? is adopted. 

The median value of 1? obtained under the first prior, from the second half of a 
two chain run to 20 000 iterations, stands at 1.03E-4, as compared to the value of 0.79E- 
4 cited by Kitagawa and Gersch using a series generated by the same process. The 
median observational variance is estimated at 1.11. The true series is reproduced 
satisfactorily (Figure 5.9). This prior leads to convergence in under 5000 iterations. 
Other priors, whether gamma or uniform on the ratios 17/0? or t/o tend to converge 
much slower. The U(0, 1) prior on t/o takes 100 000 iterations to obtain o? around 1.1 
and a median on 7? of 0.6E-4, and provides a slightly better fit to the high values of the 
series. 

Convergence may be facilitated by direct sampling from one or perhaps both the full 
conditionals on Pı = 1/o7 and P = 1/7’, namely 
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4.5 


Value 


—— Actual —- Fitted (prior A) 


Figure 5.9 Actual and fitted smooth curve (asymmetric Gaussian) 


T 
Pi ~ Gla 0.5T,b; 0.55 > e0) 


t=1 


T 
Pa~ Gay OST- k), bi 0.55 PO) 
t=k 
where one might take a; = b; = 0.001, j = 1, 2. An exercise is to compare the fit of an 
RW(3) model with the RW(2) model using the pseudo-marginal likelihood or other 
model assessment approach. 


Example 5.11 UK coal consumption As an illustration of a smoothing model for a 
time series with seasonal effects, as well as a secular trend (decline) over time, we follow 
Harvey (1989) and Fruhwirth-Schatner (1994) in considering a UK coal consumption 
series C; of 108 quarterly observations from 1960-1986. The Basic Structural Model 
(BSM) of Harvey (1989) may then be applied, with a seasonal smoothing prior appro- 
priate to quarterly data. The outcome y; is the log of the original consumption series 
divided by 1000, namely 


Yı = log (C,/1000) 


Assuming Normality in this outcome, the model has the following components 


yr ~ N(x, 84) (Observation model) 

X= St (Underlying trend after allowing for seasonal effect) 
u; ~ Nui amı, 91) (Evolution in mean) 

arı ~ N(am2, 92) (Increments to mean) 


St ~ N( — S1 — 81-2 — St-3, 03) (Seasonality) 
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Fruhwirth-Schatnor considers the series for 1960-1982 and the utility of Gibbs sam- 
pling as against the Kalman forward filtering-backward sampling algorithm. Here, a 
uniform prior U(0, 1) is set on the ratio 01/04, and with the precisions 


0;', 651, 0;! 


assigned the diffuse G(0.01, 0.00001) priors adopted by Fruhwirth-Schatner. As an 
alternative model, consider a lag 1 autoregression in a;_;, namely 


arı ~ N(bay_2, 92) 


To assess fit, the predictive loss criterion of Laud and Ibrahim (1995) is used. Two chain 
runs of 75000 iterations on both models (with summaries based on the last 50000 
iterations) show a lower predictive loss on the autoregressive model, with variance 
component estimates as in Table 5.4. The density of p is concentrated on negative 
values, with a posterior mean of —0.29, but is right skewed with 97.5% point of 0.65. 
Figure 5.10 shows the actual and fitted series with the basic structural model. 


Example 5.12 Impact of TV advertising This example illustrates evolving regression 
impacts as in Equation (5.36b), and also involves a binomial outcome. Thus, a study on 
the impact TV advertising involved asking a set weekly total of 66 individuals a ‘yes or 
no’ question about an advert for a chocolate bar, and was continued over 171 weeks 
(Migon and Harrison, 1985). The number of positive answers r; in week ¢ is modelled as 
binomial with logit link to a single covariate, x, = weekly expenditure on advertise- 
ments. Thus 


rı ~ Bin(66, m+) 
logit(a,) = Q; Bx; 
with the parameters evolving according to 
Qt = Qr- Wa 
B: = B1 œ% 


RW(1) priors for a; and B, are adopted with a diagonal dispersion matrix with variance 
terms var (wa) = 02 and var (ws) = Op Priors for the initial conditions a; and B; are 
relatively diffuse, namely N(0, 1000). 


Table 5.4 Components of variance, coal consumption 


Basic Structural Model Mean St. devn. 2.5% Median 97.5% 


Variance of smooth series 0.00156 0.00138 0.00005 0.00118 0.00528 
Variance of trend 0.000018 0.000019 3E-6 0.000013 0.000070 
Seasonal Variance 0.00009 0.00014 0.00000 0.00004 0.00045 
Residual Variance 0.01544 0.00276 0.01049 0.01526 0.02135 


Autoregressive Model 


Variance of smooth series 0.00339 0.00191 0.00023 0.00316 0.00785 
Variance of trend 0.00110 0.00204 0.00001 0.00030 0.00640 
Seasonal Variance 0.00009 0.00016 0.00000 0.00004 0.00049 


Residual Variance 0.01336 0.00280 0.00842 0.01321 0.01922 
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Figure 5.10 Actual and predicted coal consumption 


With three chain runs of 15000 iterations, posterior mean estimates of o2 and OB are 
close to those cited by Fahrmeir and Tutz, though their variance for the slopes at 0.0002 
is smaller than the estimate here of around 0.00045. The plot of the parameters 
themselves (Figure 5.11) shows a decrease in the mean (i.e. awareness level) though 
the positive impact of advertising is more or less stable. The parameters in Figure 5.11 
are based on a run of 7500 iterations (2500 burn in) with every tenth iterate retained. 

An alternative model (Model B) with a multivariate Normal prior on a; and B, jointly 
yields very similar estimates for Op, and on the two sets of evolving parameters 
themselves. 
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Figure 5.11 TV advertising, intercepts and slopes 
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5.6 STOCHASTIC VARIANCES AND STOCHASTIC VOLATILITY 


There are many instances, including the dynamic coefficient models just discussed, 
where it may be necessary to model observed time series y, or make forecasts, when 
the variance is not fixed, but itself stochastic over time. Such situations are exemplified 
by stock price and exchange rate series where large forecast errors tend to occur in 
clusters, when the series are unsettled or rapidly changing. This is known as volatility 
clustering, and suggests dependence between successive values of the squared errors. In 
many applications of such models, the series has effectively a zero mean; for example, in 
many financial time series (e.g. exchange rates or price indices) the ratio of successive 
values E,/E,-; averages 1, and a series defined by the log of these ratios 
yı = log (E,/E;_1) will then average zero. 


5.6.1 ARCH and GARCH models 
Following Engle (1982), consider a time series regression 
Y= PBX € 

in which the error variances for subsets of the full period are different. One way to 
accommodate heteroscedasticity through time (i.e. changes in scale) is to introduce an 
exogenous variable z (or lagged values of the outcome), with the model now being 
perhaps 

V=Bxur z272: 


Forecasts of y now depend upon z,, as well as the preceding values of the outcome itself. 
Another option involves an autoregressive conditional heteroscedastic or ARCH model 
(Greene, 2000), such that 


Y= BX, & =BX, ryh (5.38) 


where the u, have mean zero and variance 1, and the h, depend upon previous squared 
errors 


h =a ae? (5.39) 


with both ag and a; positive. While the most usual assumption is u, ~ N(0, 1), Bauwens 
and Lubrano (1998) consider u; ~ Student(0, 1, v). 
Then E(e,|€,-1) = 0, and the conditional variance 


V, = var(é,|é-1) = E(e;|€:-1) = E(u; [ao aiez] 
2 
= Qo QIE] 


is heteroscedastic with respect to ¢,_;. Specifically, Equation (5.39) defines an ARCH(1) 
model, whereas an ARCH(2) model would involve dependence on ¢?_,. Thus 


2 2 
hi = ao QE, | Q2E; > 


If |a| < 1 in Equation (5.39), then the unconditional variance has mean zero and 
variance ag/(1 — a1). The log likelihood for the tth observation under Equation (5.38) is 


log L; = —0.5log 27) — 0.5 log h, — 0.587 /h, (5.40) 
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where e; = y: — Bx;. If y, has an effectively zero mean and the regression model does not 
involve predictors x;, then one may follow Engle (1982) and write 


Yt = urJh; (5.41) 
where for an ARCH(1) model 
h, = ap aiy? 
The log likelihood for the ¢th observation is then 
log L; = —0.5 log (27) — 0.5 log h, — 0.5y?/h; (5.42) 


In the GARCH model the conditional variance V, depends upon previous values of V, 
(or of h,), as well as on lags in ¢?. Whereas lags in e? are analogous to moving average 
errors in an ordinary ARMA time series, lags in V, are parallel to the autoregressive 
component (Greene, 2000). A GARCH(p, q) model involves a lag of order p in V, and 
one of order q in £?. Thus, a GARCH(1, 1) model would be either 


2 
hi =ao YV E 
or 
h =a h 2 
t=O Y-i QIE] 


where ao > 0. The likelihood is as in Equation (5.40). The stationarity conditions here 
are that a} y< 1, together with a; > 0, y > 0. The specification in Equation (5.41) 
leads to a GARCH(1, 1) model in which 


h=00 Wir aye, (5.43) 
with likelihood as in Equation (5.42). 


5.6.2 Stochastic volatility models 


Another option for modelling changing variances is known as stochastic volatility, and 
includes models within the state-space framework (Kitagawa and Gersch, 1996). Thus 
in 
Y= PBX & 

with e; ~ N(0, V,), one may assume the evolution of AK log V, follows a random walk 
process. For example, taking k = 1, and setting g; = log V;, gives a first order random 
walk which may follow a Normal or Student form: 

gi ~ N(gi-1, o?) 


An alternative stochastic volatility formulation (Pitt and Shepherd, 1998) involves 
autoregressive dependence in latent variables k,, which represent the evolving log 
variances. Thus for a series with zero mean and no regressors, one might specify first 
order dependence in the latent log variances (Harvey et al., 1994), with 


Yt = Ut exp (k:/2) (5.44) 
and 


K= k-i Ny 
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where the u, are N(0, 1) and n, ~ N(0, o2). If |b| < 1, then the k, are stationary with 
variance oł /(1 — p°). 

Another option is denoted the unobserved ARCH model (Shephard, 1996), in which 
an ARCH model still holds but is observed with error. This is generally classified as a 
stochastic volatility approach. For a zero mean observation series y and no covariates, a 
measurement error model combined with an ARCH model leads to 


ye ~ NOAs, o°) 
h, ~ NO, h) (5.45) 
hı = Qo aZ] 

To ensure h, is positive, ag and a; are constrained to be positive, and the further 


restriction 0 < a; < 1 ensures that the ARCH series is covariance stationary. If there 
were covariates or lags in the model for y, then the \, would be distributed as 


Ny ee N(p,; hi) 


where, for instance, p, = Bx;. 

For multivariate series (e.g. of several exchange rates) subject to volatility clustering, 
factor analysis type models have been proposed to model the interrelated volatility (Pitt 
and Shephard, 1999; Harvey et al., 1994). For instance, for two series y, k = 1, 2 and 
one factor f;, one might have 


yn =Bift on 
Yo=Roft on 


with f; and the ow, evolving in line with stochastic volatility. Thus 
fi~ NO, exp(k4)), on ~ NO, exp(k®!)), wn ~ N(0, exp(k®)). Then first order 
autoregressive dependence in the latent log variances would imply 


foe fod f 
Ky SP Ki Ny 
wl _ wl æl wl 
K =P Ka M 
w2 _ w2 w2 w2 
K =P Ka M 


Example 5.13 Spot market index To illustrate in a comparative fashion both ARCH 
models and stochastic volatility models, consider the weekly spot market index series £; 
of Bauwens and Lubrano (1998) relating to the shares of Belgian firms at the Brussels 
stock exchange for 3-1-86 to 26-1-96, namely 508 observations. The transformed 
outcome is the index return, given by the first difference of the logarithm of the index 
(times 1000), namely y; = 1000[ log (E;) — log (Z,-1)). 


We first consider a variant of the ARCH model, with 
Wah Hh Uy 
where u, ~ N(O, 1) and 
h =a0 oe? ,=09 ai- py (5.46) 


The coefficient a; is constrained to be between 0 and 1, and ap to be positive. Fit is 
assessed by the CPO estimate based on the average of the inverse likelihoods, and by the 
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predictive loss criterion of Gelfand and Ghosh (1998). The initial variance / is taken as 
a separate fixed effect. The transformed outcome used by BL leads to large variances, 
and to accommodate this a uniform prior U(0, 1000) for ao is adopted. 

A two chain run to 5000 iterations (and 1000 burn-in) shows the lag coefficient a; in 
Equation (5.46) to have 95% interval (0.28, 0.66), and ap one of (182, 261). The 
predictive loss criterion is 380 200. 

A second approach is provided by a GARCH(1, 1) model applied to the series 
y, = yı — 2.246, where y = 2.246. This avoids introducing a parameter for the mean of 
Yn and h; is then as in Equation (5.43). The posterior means for a; and y are, 
respectively, 0.26 (s.e. 0.06) and 0.63 (s.e. 0.07). The predictive loss falls to 372 500. It 
may be noted that initial iterations with this model are very slow. 

A third model (Model C) for these data is provided by the unobserved ARCH model 
of Equation (5.44), with a non-zero mean pu assumed for the à;, so that 


ye ~ NO, o°) 

^: ~ N(w, hi) 

e a pb 

hy = Q0 Oye? 
This also yields an improvement in fit over the ARCH model of Equation (5.46), with 
predictive error of 373 200. This is assessed from the second half of a two chain run over 
6000 iterations. Convergence in the parameters ag and a; is only obtained after about 
3000 iterations using the over-relaxation option. The lag coefficient a; has a median of 
0.33, but is not precisely identified, having 95% credible interval {0.02, 0.96}. The most 
distinct outlier (with lowest CPO) is associated with the sharp drop between weeks 90 
and 91 from 4516 to 3955. 


The final model (Model D) is provided by a random walk model for the stochastic log 
of the variance V;, with 


YVtSM & 
where 
e; ~ NO, V») 
Because BUGS is parameterised in terms of precisions, one may set P, = 1/V; and then 
take H, = log (P;) as an RW(1) process: 
H, ~ N(Hi, o4) 


It is worthwhile experimenting with alternative priors on the smoothing variance, since 
this parameter is crucial to the performance of the model (see Exercises). Taking the 
prior 


1/04 ~ G(1, 0.0001) 


leads to a predictive loss criterion of 365900. The plot of the precisions P, in 
Figure 5.12 shows the highest volatility (i.e lowest precision) at weeks 90-100 and 
225-235. 


Example 5.14 Exchange rate Durbin and Koopman (2000) and Harvey et al. (1994) 
consider stochastic volatility models of the form 
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Figure 5.12 Spot market index volatility 


Vi = Bx: Ur / hy 


us ~ N(O, 1) 
h = 0° exp (0;) 
0 = 0-1 n 


where the n, are normal with variance Cae Durbin and Koopman consider a series on 


the pound-dollar exchange rate between October Ist 1981 and June 28th 1985, and 
define a model with no predictor term Bx; or constant Bo. It is valid to assume that 
Bo = 0, since the observations consist of differences in logged exchange rates E,, with 
yr = Alog (E;). 

In BUGS, the time varying dispersions are parameterised via precisions 


P, = 1/h, 
and the first model follows Durbin and Koopman in assuming stationarity in 0, with 
0<o<l 


We find — from the second half of a two chain run of 5000 iterations — median estimates 
for o and o, of 0.50 and 0.16, whereas Durbin and Koopman cite values of 0.63 and 
0.17. They estimate a lag coefficient b of 0.973, whereas the value obtained here, using a 
U(0, 1) prior, has mean 0.982 and a 95% interval {0.959, 0.997} .The variances are below 
0.5 for most of the period but increase to over | in the spring of 1985 (see Figure 5.13), 
exceeding 2.5 for some days. 

An alternative model drops the stationarity assumption, since seems to approach 
1 and hence non-stationarity might be implied. A strategy allowing non-stationarity was 
proposed by Gamerman and Moreira (1999), though it is theoretically implausible in 
the long run. Despite setting a prior 6 ~ N(0, 1), the posterior 95% interval is still 
entirely below 1. The pseudo marginal likelihood is slightly lower (—911.6) than the 
stationary option (—909.6). 
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Figure 5.13 Volatility of exchange rate series 


5.7 MODELLING STRUCTURAL SHIFTS 


State space models are designed to accommodate gradual or smooth shifts in time series 
parameters. Often, however, there are temporary or permanent shifts in time series 
parameters that occur more abruptly, and a more appropriate model allows for changes 
in regression regimes and other shifts in structure. Section 5.2 considered innovation 
and additive outliers. Here we consider models that allow for repeated switching 
between distinct regimes according to a latent Markov series, and models for shifts in 
both the mean and variance of autoregressive series. A further category includes 
switching regression models (Maddala and Kim, 1996). 


5.7.1 Binary indicators for mean and variance shifts 


McCulloch and Tsay (1994) consider autoregressive models allowing for shifts in mean 
and/or variance. By allowing for variance shifts as well as changes in level, nonstation- 
ary trends that might otherwise have been attributed to changes in level may more 
appropriately be seen as due to heteroscedasticity. McCulloch and Tsay choose to focus 
explicitly on autoregressive models, rather than introduce moving average effects be- 
cause given a sufficiently large AR model, one may achieve similar results to a stipulated 
ARMA model. Thus let 


Vis Vy Et 
where a change in level is accommodated by letting 
Be = Mi uv (5.47) 


The 8), are binary variables for each time point which equal 1 if a shift in mean occurs 
and v; models the shift that occurs, conditional on 8); = 1. The v; are usually modelled 
as normal with mean zero and low precision T,. The autoregressive component of the 
series is the p-lag model for ¢;, namely 
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Et = V1Er-1  Y2&r-2 ++ Yr-pët-p Mt (5.48) 


where shifts in the variance of u, are allowed. Thus, let u; ~ N(0, V;), and let 8, be an 
additional binary series such that 


V, — Fii (82, = 0) 


(5.49) 
= Vio (82 = 1) 


where œ, models the proportional change in the variance. Alternatively 
V= Vi 82) 


The w; are positive variables with, for example, a gamma prior. The binary shift 
indicators may be taken to be Bernoulli variables, where the probabilities that 6, and 
62, equal 1 are known and small (e.g. nı = nz = 0.05). Alternatively, these probabilities 
may be assigned beta priors that favour low values. The relative importance of the 
mean and variance shift components (as reflected in the sizes of n; and n; if they are 
free parameters) will be affected by prior specifications on the w, and the variance of 
the v;. 


5.7.2 Markov mixtures 


A different approach to changes in regime involves the Markov mixture model of Chib 
(1996), Leroux and Puterman (1992), and others. Thus suppose for each time point the 
process is in one of m states {s;}(t > 1), as determined by an m x m stationary Markov 
chain P = {p,;}, where 


py = Pris; = jls- = į (5.50) 


The first state (namely sı) is determined by drawing from a multinomial with m 
categories. Given the underlying state s; = k, the observation itself has the kth of the 
m possible components of the mixture, and these components might differ in means, 
variances or other summary shape parameters. 

A model with both mean-variance shifts and reference to a latent Markov series is 
suggested by Albert and Chib (1993). Their model has m = 2 and values s; = 0 or s; = 1. 
Thus, an order p autoregression in the regression errors allows for shifts in mean and 
variance of y, within the specification 


Yilsı = Bx; ws, 10% 1 — Bxr1 — Ws, 1) Yor 2 — Bx1-2 — ws; 2) 


(5.51) 
Sat Yı p ~BXiH-p — VSi-p) Ur 
where u, ~ N(0, V;). The variance shifts are produced according to the model 
V, =0°(1 os) (5.52) 


where w is the proportionate shift in variance when s, = 1. This model involves possible 
correlations between B and the mean shift parameter , especially if x, just consists of an 
intercept. Improper or weakly informative priors may reduce identifiability. 


5.7.3 Switching regressions 


Switching regression models originate in classical statistics with Quandt (1958), and 
have received attention in Bayesian terms in works by Geweke and Terui (1993), 
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Lubrano (1995b), among others. In such models, the conditional expectation of the 
endogenous variable may follow two or more regimes. Suppose, for example, that the 
error variance does not also switch and that some regression effects (applying to 
exogenous variables z,) are not included in the switching. 

The choice between regimes is determined by a threshold function K, that drives 
either abrupt switching by a step function or a smooth transition function. The latter is 
typically a cumulative distribution function between 0 and 1, such as the logit (Bauwens 
et al., 1999). 

For instance, a step function A; might be defined as one if a trend in time exceeds a 
threshold 7t, and zero otherwise. If the trend were simply measured by the linear term t, 


then 
K,=t-T<05A,=0 
T f (5.53) 
K,=t-t>0s3A,=1 


The simplest model then allows for two regimes: 

Y= Yz (l= A)Bix ABX u (5.54) 
where, for example, u, ~ N(0, o°). The threshold function might also be defined by lags 
on the outcome, such as in the step function scheme 

Ki = yı -d<0>4=0 
K,=y,1-d>05A,=1 
More generally, the appropriate lag r in y,, such that 
r=1 iya >d 
is an additional unknown. Geweke and Terui (1993) consider joint prior specification 
for {r, d} in models where the alternative regimes are different order lags in y; such as an 
AR(?1) model if A; = 1, and an AR(p2) model (with different coefficients throughout) if 
A; —= 0. 
A smooth transition function in these cases might take the form 
A; = exp(9{yr-1 — d})/[1 exp (etyr-1 — d))] 
or 


A, = exp (ẹ{t- T1})/[1 exp(@{t — 7})] 


where @ > 0 governs the smoothness of the transition. 


Example 5.15 Fetal lamb movements An example of the Markov mixture model is 
provide by a time series of lamb fetal movement counts y, from Leroux and Puterman 
(1992), where the presence in the mixture of more than one component leads to Poisson 
overdispersion. One might model such over dispersion by a gamma mixture (leading to 
a marginal negative binomial model). 


Alternatively, suppose, following Leroux and Puterman, that a two class Markov 
mixture applies, with shifts between two Poisson means determined by a Markov chain 
(i.e. m = 2). Relatively diffuse Dirichlet priors for each row of P are adopted, such that 
for the elements in Equation (5.50) one has 


Pi, \:m ~ DA, l, pi 1) 
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The same prior is used for the first period state choice model. For the two Poisson 
means G(1, 1) priors are stipulated, with an identifiability constraint that one is larger — 
an initial run justified such a constraint, showing the means to be widely separated. 

With this model, a two chain run of 5000 iterations (1000 burn-in) shows a state 
occupied most of the periods (about 220 from 240), which has a low average fetal 
movement rate, and a minority state with a much higher rate, around 2.2-2.3. The 
majority state has a high retention rate (reflected in the transition parameter po. around 
0.96) while movement out of the minority state is much more frequent (Table 5.5). 

The actual number of movements is predicted closely, though Leroux and Puterman 
show that using m = 3 components leads to even more accurate prediction of actual 
counts. The model with m = 2 shows relatively small CPOs for the movements at times 
85 and 193 (counts of 7 and 4, respectively). 

For comparison, and since the outcome is a count, Model B consists of an INAR(1) 
model. The ‘innovation’ process is governed by Bernoulli switching between means à; 
and M (with M2 > A; to guarantee identifiability). Thus, 


Y, ~ Poi(p,) 

W; = T? Yi-1 Ao do(1 = ò) t> 1 

by =M, A — ô) 
with 6, ~ Bern(n) and n assigned a beta prior. This model also identifies a sub-popula- 
tion of periods with a much higher movement rate, around 4.5, than the main set of 


periods. It has a very similar pseudo-marginal likelihood to the two-state Markov 
switching model (—180 vs. —179). 


Example 5.16 US unemployment As an illustration of models allowing both mean 
and variance shifts, consider the US unemployment time series analysed by Rosenberg 


Table 5.5 Lamb movements, Markov mixture model parameters and predictions 


Mean St. devn. 2.5% Median 97.5% 
Pi, 1 0.66 0.15 0.35 0.67 0.93 
Pi,2 0.34 0.15 0.07 0.33 0.65 
P2, 1 0.04 0.03 0.01 0.03 0.12 
P2, 2 0.96 0.03 0.88 0.97 0.99 
Periods with s = 1 17.8 9.4 6.0 17.0 40.0 
Periods with s = 2 222.2 9.4 200.0 223.0 234.0 
Ni 2.28 0.74 1.22 2.15 4.01 
A2 0.23 0.05 0.14 0.23 0.32 
Number of movements, actual and predicted Actual Events Predicted Events 
0 182 180.2 
1 41 43.7 
2 12 9.9 
3 2 3.6 
4 2 1.5 
5 0 0.6 
6 0 0.2 
7 1 0.1 
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and Young (1995) with the original six month average percent rates U, transformed 
according to 


yı=100xln(1 U, 1/100)—100x In(1___U;,/100) 


Here the full data set consists of monthly data from 1954-1992 inclusive, providing 
78 six monthly averages. As in Section 5.7.1, assume the binary indicators 6), and 8, 
for shifts in the means p,, and in the variances of the x, series are Bernoulli with 
probabilities n; and n), respectively. These are taken as extra parameters, and the 
priors for n; and n, determine the level of shifting and the results of the analysis may 
depend on how they are set. The relative size of n; and n, is also affected by the 
variances assumed for the variance of the v; and the informativeness of the prior 
assumed for the w+. 

Following Rosenberg and Young, a relatively low prior chance of shifts in either 
mean or variance is assumed, with n; ~ BU, 19) and n ~ BU, 19). The autoregressive 
series is taken as order p = 1. Rosenberg and Young experimented with a lag 2 model, 
but found it to produce no improvement in fit. The proportional shifts, as in Equation 
(5.49), are taken to have a gamma prior favouring a concentration around an average of 
1, namely 


wr ~ GCS, 5) 


One might experiment with different prior assumptions on the degree of concentration 
around the average of 1, e.g. by alternatively taking G(1, 1) and G(10, 10) priors for the 
w,. As to the variance of the v,, this can be preset, and Rosenberg and Young suggest 
using a large multiple (e.g. 10 times) the residual variance from a standard ARMA 
model. Fitting ARMA(1, 1), ARMA(2, 2) and similar models showed a residual variance 
around 0.25, and so taking v, ~ N(0, 2.5), is one option. It might be sensible to assess 
alternatives involving other multiples, such as v; ~ N(O, 5) or v; ~ N(O, 1). One might 
also take the variance of the v; as an additional parameter. 

Here a two-chain run to 2500 iterations and with v; ~ N(0, 1) shows probabilities of 
0.014 for level shift and 0.124 for variance shift. Identifiability is improved by directly 
sampling the shift probabilities n; and n> from their full conditional densities. The lag 
1 parameter y; in the autoregressive series is estimated as 0.48. Rosenberg and Young, 
in an analysis of quarterly rather than six monthly series, also found a higher probabil- 
ity n, of a variance shift than a mean shift, but with the excess of n, over n; (0.086 vs 
0.015) smaller than that estimated under the model here. A very close fit to the observed 
differenced and transformed series is achieved, with no discrepant CPOs (the smallest, 
around 2% of the maximum CPO, is for observation 35 in the series of 77). 

For comparison, a relatively simple state space model in the undifferenced and 
untransformed six-monthly unemployment rate is also fitted (Model B in Program 
5.16). This involves a first order random walk model in the ‘true’ series. The sum of 
squared prediction errors of actual vs. predicted unemployment rates (E.s in Program 
5.14), shows this relatively simple model to have broadly comparable fit to the mean and 
variance shift model. 


Example 5.17 Consumption function for France To illustrate regression switching, we 
follow Bauwens, Lubrano and Richards (1999, p. 248) in considering a stochastic 
consumption function for France. This is applied to logged consumption and income 
for 116 quarterly points from 1963Q1 to 1991Q4. The outcome is the first difference 
Alog C; in logged consumption, and is related to the comparable income variable 
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Alog Y;, and to the lag 4 difference in log consumption. Conditioning on the first five 
observations gives for t = 6, 116 


AlogC,;=6 7D.1969Q2 yAlogC,-4 BAlog Y, 


(5.55) 
(p — D[log Cı — vlog Yı] ur 


where u; ~ N(0, o?). The term D.1969Q2 reflects a short-run distortion due to the wage 
increases following the Matignon negotiations of 1969. As to the error correction term 
in square brackets this reflects a long-term equilibrium between consumption and 
income 


log C* = logK vlog Y* 


Providing v, the long run elasticity, is 1, then C* = KY* and the propensity to consume 
K is constant. In a dynamic (first differences) framework, the long-term equilibrium is 
expressed as 


Alog C* = vA log Y* 


This implies that a shift in the propensity to consume (i.e. a change in K) only changes 
the constant in Equation (5.55), with all other regression effects unaffected by the 
switching. 

The observed propensity to consume K in this period in France is distorted not just 
by the 1969 negotiations, but by a longer run upward movement from around 0.77 in 
1978 to 0.89 ten years later. The question is whether, despite this apparent trend, 
an underlying constant propensity can be obtained by suitable parameterisation of 
the consumption function (5.55), and by switching of the regression constant 6 in this 
function. 

Bauwens et al. first fit Equation (5.55) without any switching mechanism, and find that 
no equilibrium is defined. They then allow a single permanent shift in the propensity to 
consume via a step function as in Equation (5.53). They find this leads to v under 1, a 
situation incompatible with a constant K. They then investigate whether a single, but non- 
permanent, shift in the mean propensity to consume restores a stable consumption 
function as defined by v = 1 (or by a 95% credible interval for v including 1). This implies 
a return to a previous equilibrium after the temporary transition from equilibrium. 

They therefore introduce a double parameter transition function such that 
I(t, 72) = 1 for quarters t between t; and t (both set within the period spanned by 
the observations) and J(1;, T2) = 0 otherwise, so that 


AlogC;=8 «l(t, 7) nD69.2 yAlogC,4 BAlog Y, 


(5.56) 
(p — Dilog Cı — vlog Yı] u 


The unit elasticity model with v = 1 is accepted with this transition function, and 
Bauwens et al. obtain a final equation (where v is set to 1) as follows (posterior means 
and SDs): 
A log C, = — 0.0088 — 0.0071I(T1, T2) 0.019D69.2 — 0.26A log C;-4 
[0.0041] [0.0017] [0.0067] [0.078] 
0.234 log Y, — 0.1 1[ log (C;-1/ Y;-1)] 
[0.069] [0.022] 
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with the mean of tı estimated as 1973.3 and of t2 as 1984.1. The prior ranges (within 
t = 6,.. 116) for these two threshold parameters t; and t2 are U(29, 61) and U(62, 98). 
These ranges are separated for identifiability and chosen by trial and error (Bauwens et 
al., 1999, p. 250). 

Estimates with this model may be sensitive to prior specifications on the single or 
double break points, tı and t2. Thus, uniform priors over the full range of times (6-116) 
may well give different estimates to priors restricted to an interior subinterval (e.g. 20— 
100). Similarly, a gamma prior such as G(0.6, 0.01) with average 60, approximately half 
way through the periods, but with large variance — and with sampling constrained to the 
range (6, 116)-might be used, combined with a constraint t2 > tı. This may lead to 
different results than a uniform prior. 

There are also possible identification and convergence problems entailed in the non- 
linear effects of p and v in Equation (5.56), when v is a free parameter. Here p is allowed 
to be outside the interval [—1, 1]. One way to deal with the identifiability problem is to 
introduce a conditional prior for v given p, or vice versa (see Bauwens et al., 1999, 
p.142), and so p is taken to be a linear function of v, namely a; av. Then N(0, 1) 
priors are adopted on a; and az, and all regression coefficients with the exception of v, 
which is assigned an N(1, 1) prior, are constrained to positive values. 

We first fit Equation (5.56) with a single break point (i.e. a permanent shift in the 
propensity to consume), v a free parameter, and a G(0.6, 0.01) prior on the breakpoint 
tı. Convergence in all parameters in a three chain run occurs after 15 000 iterations and 
from iterations 15000-20 000, a 95% credible interval for p of {1.02, 1.19} is obtained, 
and a pseudo-marginal likelihood of 390. The density for v is concentrated below unity, 
with 95% interval {0.86, 0.95}. The density for 7; is negatively skewed and has some 
minor modes; however, there is a major mode at around t = 85 to t= 90, with the 
posterior median at 87 (i.e. 1984.3). 

To fit Equation (5.56) with v still a free parameter, and two breakpoints (i.e. a 
temporary shift in the propensity to consume), the intervals (29, 61) and (62, 98) of 
Bauwens et al. are used in conjunction with gamma priors (Model C in Program 5.17). 
Taking wider intervals within which sampled values may lie, such as (7, 61) for 7, 
causes convergence problems. Even with the same intervals for the breakpoints as 
adopted by Bauwens et al., convergence on v is slow. The Gelman—Rubin scale reduc- 
tion factor on v remains at around 1.2 after iteration 8500 in a three chain run, and the 
95% interval from 5000 iterations thereafter is {0.88, 1.02}, including the equilibrium 
value of 1. On this basis the same model (5.56), but with v set to 1, may be fitted (see 
exercises). 

It may be noted that an alternative methodology uses a smooth rather than abrupt 
transition function, such as 


A,=1/[1  exp(— 9(t — 71)(¢— 72))] 


Other variations might include a stationarity constraint with |p| < 1. 
5.8 REVIEW 
Bayesian time series analysis offers flexibility in several areas, and is now a major theme 


in new time series developments. Among major modelling areas illustrating the benefits 
of a Bayesian approach may be mentioned: 
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e the lesser restriction to stationarity in autoregressive time series as compared to 
classical ARIMA analysis (e.g. Nandram et al., 1997); 

e the representation of nonlinear time series and time-varying parameters through 
dynamic linear models (West and Harrison, 1997); a recent survey of state-space 
time series modelling (including non-Gaussian cases) is provided by Tanizaki and 
Mariano (1998); 

e the reduction of information in a large set of variables by using a factor structure in 
stochastic volatility applications (Aguilar and West, 2000; Meyer and Yu, 2000); 
time-varying volatility is also tackled by ARCH models with recent Bayesian studies 
including Vrontos et al. (2000); 

è the ability to model structural changes in different aspects of the time series such as 
level trend and variance (see Section 5.7 and Wang and Zivot, 2000); 

e recent developments in dynamic linear and semiparametric modelling of discrete 
outcomes (Cargnoni et al., 1997; Fahrmeir and Lang, 2001). 
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EXERCISES 


1. In Example 5.1, assess the evidence for a non-zero p by applying the binary coeffi- 
cient selection method to the outlier innovation model (Model C in Program 5.1). Is 
this inference affected by taking a smaller value of A (e.g. 0.01)? 


2. In Example 5.2, try the same modelling approach for the velocity series (1869-1988). 
This series is identified as nonstationary (with significant probability that p > 1) by 
Bauwens et al. (1999). The data for this series are included in Example 5.2. 


3. In Example 5.2, assess the inferences regarding stationarity when the innovations are 
assumed Normal rather than Student t. 


4. In Example 5.10, fit a true series evolving according to an RW(3) model such that 


fO=¥E- 1) -3ft-2 FfE-3) uD 
and assess its fit against the RW(2) model. 


5. In Example 5.13, try the ARCH Model A but introduce a lag in y,_; into the model 
for the y series. Does this improve the predictive loss criterion? 


6. In Example 5.13, try alternative priors on the random walk variance in Model D 
(such as a uniform prior on V}° or alternative gamma priors on 1/V;), and assess 
effects on model fit. 


7. In Example 5.15, fit a Markov switching model with a three state transition matrix, 
and similarly the INAR(1) variant model with an innovation process mixing three 
different means. How do these compare in terms of the DIC criterion and pseudo- 
marginal likelihood? 


8. Fit the French consumption model of Example 5.17, as in Equation (5.55), with two 
breakpoints and v set to 1. Compare inferences obtained with gamma priors on 71 
and 7 constrained to the intervals (29, 61) and (62, 98) with those obtained using 
uniform priors on the same intervals. How does the fit obtained compare to a model 
with one breakpoint only? 
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CHAPTER 6 


Analysis of Panel Data 


6.1 INTRODUCTION 


Panel or longitudinal data sets occur when the continuous or discrete response Y; of 
each subject i(i = 1,.., N) is observed on several occasions t = 1,.. 7;. Occasion totals 
T; may differ between subjects, as may times of observations v;,, and so spacings 
Ait = Vit — Vi, t-1 between observations. Such a panel or serial data set may pertain to 
individual patients in clinical trials or subjects of follow-up surveys, or to aggregated 
units such as population age groups or geographical areas. The analysis of change in 
serial measurements over individuals or groups plays a major role in social and bio- 
medical research, and is fundamental in understanding causal mechanisms of disease or 
social pathology, in assessing the impact of policy or treatment interventions, and in the 
analysis of developmental and growth processes. For instance, in economic applica- 
tions, the panel may be at individual, household or firm level and relate to questions 
such as economic participation or consumption (for households or individuals) or 
patent activity and investment levels at firm level. 

Major methodological questions in panel data include the modelling, via random 
effects, of permanent subject effects, of growth curve parameters, and autocorrelated 
errors. Also important are the extension to categorical outcomes (binary, multinomial 
or count data) of methods originally developed for continuous outcomes (Chib 
and Carlin, 1999). Missing data often occur in panel studies (see Section 6.5), especially 
permanent loss or ‘attrition’ of subjects, where 7; is less than the maximum span of 
the study in clinical trials or panel studies of economic interventions (Hausman and 
Wiseman, 1979); Bayesian perspectives on this issue include those of Little and Rubin 
(1987). 

The accumulation of information over both times and subjects increases the power of 
statistical methods to identify effects (e.g. treatment effects in medical applications), and 
permits the estimation of parameters (e.g. permanent effects or ‘frailties’ for subjects i) 
that are not identifiable from cross-sectional analysis or from repeated cross-sections 
on different subjects. While cross-section data can be used to estimate age or cohort 
related change, these estimates rely on differences between groups rather than individual 
change profiles (Ware, 1985). Diggle et al. (1994) provide an example of a cross- 
sectional relation between reading ability Y; and age X; 


Yi=a BX; & 
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where B is the average difference in ability for groups differing by a single year of age. 
With T repeated yearly observations on N individuals one may identify a more substan- 
tively informative model 


Yn =a BX, (Xr Xa) Er (6.1) 


where $ is the same as in Equation (6.1), and ò is the growth in ability for an extra year 
of age. Growth curve models generalise Equation (6.1) by using random effects to 
model variability in baseline ability or growth over subjects. 

In economic applications, a cross-sectional model into, say, patent activity by firms is 
limited to considering the impact of observed predictors on the outcome. A longitudinal 
model gives better scope to assess the role of more or less constant unobserved hetero- 
geneity between firms, in terms of entrepreneurial and technical skills which affect 
patent applications, and may be difficult to capture with observable variables (Winkel- 
mann, 2000). 

Longitudinal designs provide information to describe patterns of change and devel- 
opment, enabling predictions of individual growth or change beyond the observed path 
that take account of not only the impact of age or time, but of additional subject 
variables. For example, Lee and Hwang (2000) consider the best choice of prior for 
the purposes of extended prediction, namely prediction beyond the observed time range 
of the sample. Predictive applications also occur in demographic and actuarial contexts, 
where observations of mortality or other vital events are recorded for several periods in 
succession, and classified by age, sex or other demographic characteristics (Hickman 
and Miller, 1981). Here one may be interested in forecasts of vital events in these 
categories in future years. 


6.1.1 Two stage models 


The modelling of subject effects via univariate or multivariate random effects leads into 
a wide class of two stage models for both growth data and other types of longitudinal 
observations. Thus, in a simple linear growth curve 


Yr =a; Sit Er (6.2) 


the random effects describe differences in baseline levels of the outcome (a;), such as the 
underlying average attainment for subject i, or in the linear growth rates of subjects (8,), 
such as differences in attainment growth. In more general models of this type, also 
including covariate effects varying over subjects, the distribution of the random effects, 
whether parametric or non-parametric, and if parametric, whether Normal or otherwise 
(Butler and Louis, 1992), constitutes the first stage of the prior density specification. 
The hyperparameters on the density of 0; = {a;, 5;...} form the second stage of the 
prior specification. 

In many studies, the interest may especially be in identifying subject level effects 
from panel data with greater reliability than is possible with cross-sectional data 
(Horrace and Schmidt, 2000), with a typical specification for continuous data taking 
the form 


Yr =i BXy Ei (6.3) 


with a and £ both Normal, and with the a; modelled as random effects independent of 
other information. Heterogeneity between subjects i in their levels on a; may be 
interpreted as unobserved differences that impact on the outcome, and reflect stable 
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unmeasured characteristics of individuals. Control for unobserved heterogeneity is then 
the basis for obtaining consistent estimates of the systematic (regression) part of the 
model, involving observable predictors (Hamerle and Ronning, 1995). An alternative 
‘factor analytic’ perspective on permanent effects is provided by Dagne (1999), in which 
the a; may themselves be related to covariates; one might have loadings à on the a; 
varying either by time 


Yr =; BXr Ex 
or by subject 
Ya = Nia; BX Ext 
with the variance of the a; pre-defined (e.g. var (a) = 1) for identifiability. However, one 
does not necessarily need to adopt this approach to relate the a; to fixed covariates W;. 
Certain prior specifications on the a; in the conventional random effects model (6.3) 


may improve identifiability if the s; are autocorrelated. For continuous outcomes, 
suppose the errors follow a first order autoregression 


Yr =O; BX eit 
Cn = Yei 1 Mir ot > 1 
uit ~ N(O, © 2) 
en ~ N(O, 07) 
Then, following Chamberlain and Hirano (1999), rather than the prior 
a; ~ N(0, 02) 
for the permanent effects, one might link the initial conditions and the a; via the prior 
a; ~ Nen, o?) 


where can be positive or negative. This amounts to assuming a bivariate density for a; 
and e; with independence between them corresponding to y being effectively zero. Note 
that if y is close to 1, it may be more difficult to identify a; and e; separately. 

Enduring differences between individuals represented by fixed subject level errors 
may be associated with a form of structural or spurious state dependence in panel data 
(Heckman, 1981). This is particularly so in panel studies of binomial events (e.g. 
unemployment, accidents, labour participation) or of multinomial choices (e.g. choices 
between product brands). The individual effect «; here may be interpreted as a propen- 
sity to experience the event, or as the utility of a certain choice, and such variation of 
itself induces correlation over time. Thus, if subject level effects have variance o? and 
the (uncorrelated) observation errors have variance o2, and denoting the combined 
error 


Nit = Eit Qi (6.4a) 
the correlation between n; at periods s and f¢ is 


T= o2/(o2 o?) (6.4b) 


€ 


Unmeasured differences between individuals may also operate through an autocorre- 
lated structure. Suppose that the observations Y; were binary such that on a continuous 
latent scale Z;, > 0 if and only if Yy = 1, and Yj, = 0 otherwise. Then 
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Zit = BXi Qit 


Qit = YO, t-1 Eit 


(6.5) 


so that a subject who has a high propensity a; at time ¢ will have a higher propensity 
than average att 1 if y is positive. 


Autoregression in the observations By contrast, true state dependence in a model for a 
binary panel outcome would mean model (6.5) would be extended to include autore- 
gression on Yy itself, for instance as follows: 


Zi = BXit pYi, t-1 Qi 


Qit = Yi, t-1 Eit 


(6.6) 


Here pY;,;-1 may be seen as measuring the association between the event in the 
preceding period and the utility or propensity in the current period. With appropriate 
parameterisation, a binary model involving a lag in observed outcome Yj, ;_1 may be 
cast as a Markov chain model (Hamerle and Ronning, 1995). In practice, models such 
as Equation (6.6) may be difficult to identify for binary outcomes, but may be feasible 
for count or continuous outcomes. 

Autoregressive panel models for continuous outcomes may involve several lags, for 
example in an AR(2) model 


Yi = BXi Pi Y;, -1 Po Y;, t—2 €i 


and may be extended to included moving average terms in the ¢;;. Although autoregressive 
(or ARMA) models and growth curve models are often seen as competing alternatives, 
there is scope for combining them. Curran and Bollen (2001) consider models such as 


Yi =O; Of PY 1 Ei 


that they term Autoregressive Latent Trait (ALT) models. 


6.1.2 Fixed vs. random effects 


While random effects models of subject level variability are increasingly used as stand- 
ard, there are possible caveats against random effects models in observational (non- 
experimental) panel studies. Suppose a model for count outcomes Yp, with means p; 
has a log link regression of form 


logum =O; BX ei 


Such a random effects model may assume subject effects a; to be independent of 
observed characteristics, so that individuals with different levels of W; or Xi; have the 
same expected value of a;. This assumption may be realised under randomisation (e.g. in 
medical trials) but may be less likely in observational settings, where selectivity effects 
operate. Allison (1994) cites the relation between depressive reaction to abortion Y and 
religion W,;: Catholic women are much less likely to have abortions, and so unless 
religion is included as a predictor, the effect of the event abortion (Xz; = 1 or 0) on the 
depressive reaction outcome will be confounded with differences in religion. Therefore, 
fixed effects models may be less restrictive in terms of their underlying assumptions. As 
well as the benefit of not assuming the independence of a; and {W;, Xu}, there is the 
improvement in robustness in not needing to specify the density of the a;. On the other 
hand, estimation and identifiability are problematic for large N and small T. 
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In a Bayesian approach, there may be less of a problem when a; is correlated with W;, 
since that correlation may be modelled in the random effects distribution, by linking the 
permanent effects to known covariates in a regression format — as is commonly done in 
multi-level growth curve models, but not confined to that setting. Thus, instead of 
regarding individual effects as nuisance factors, the latent frailty, severity, or attitude 
may be modelled as functions of covariates. Dagne (1999) contrasts random effects 
models with a slope of unity on the terms a;, to latent variable models which include 
both regression of the a; on covariates and allow a varying slope over subjects on the 
latent variable. 


6.1.3 Time dependent effects 


As well as using panel data to sharpen inferences about individual differences, in some 
circumstances a two stage model with random effects over time (rather than over 
individuals) at the first stage may be relevant. As compared to time series data, fewer 
points are needed to model evolving regression coefficients. Thus let counts of a health 
event Y;, be assumed Poisson 


Yir ~ Poi(Ejrbjz) 


where Ep are expected events using demographic standardisation. One may 
be interested in the changing impacts of covariates, fixed W;, or varying, Xj, as 
follows: 


log (wiz) = Bor Bi -Xit 


where By, and B,, are random over time. For example, in a study of changing levels of 
suicide mortality over the N = 33 boroughs of London, Congdon (2001) discusses 
models for the changing impact of socio-economic variables (area deprivation and 
community stability) on the health outcome. A further modelling choice is thus between 
a focus on variation in change between subjects or on temporal change in the impact of 
fixed or fluctuating covariates defined for subjects. 

One may also consider evolution in other parameters, such as lag coefficients on 
previous values of the outcome. In a model for continuous Y; with lag on Y; ;1, one 
might specify, following Curran and Bollen, 2001), an ALT model with time varying 
AR(1) parameter: 


Yr =a; dit P: Y;, t-1 Eit 


6.2 NORMAL LINEAR PANEL MODELS AND GROWTH 
CURVES FOR METRIC OUTCOMES 


This section considers in more detail the specification of the linear regression model for 
panel data on continuous outcomes, which sets the basis for panel models for discrete 
outcomes. Suppose the observations Y; = (Yn, Yn,.... Yir) on subjects i = 1,.., N are 
of equal length T, and their means p; depend upon a vector (of length p) of fixed 
covariates X;. Then 


Yr = Xi Eü (6.7) 
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where s; |:7 is multivariate normal with mean zero and T x T dispersion matrix >: 
Extension to time varying covariates Xj is straightforward. Time varying predictors in 
growth curve models might well functions of the times t = 1,.. T. 

Assume ¥ = =~! has a Wishart prior density with scale matrix R and degrees of 
freedom r, and B has a multivariate normal prior with mean By and dispersion matrix 
Bo. Then the full conditional distribution of B given >! is multivariate normal 


N,(8*, B*) (6.8a) 
where 
B* = B*(By'By JO Xabi) (6.8b) 
i=l, N 
and 
(BY =By' SO Xap, (6.8c) 
i=l, N 


The full conditional of y is Wishart withr N degrees of freedom and scale matrix R*, 
where 


(RT =R' J gig; (6.8d) 


i=l, N 


In a growth curve analysis, the design matrix X; would typically be time specific, but 
with equal values over subjects i. It might consist of an intercept Xj = 1 for all subjects 
and times, with succeeding covariates being powers or other functions (e.g. orthogonal 
polynomials) of time or age t. Thus for a linear growth model Xj = t, while a quadratic 
growth model would involve a further column in X, namely X; = £. If common 
coefficients B1, B2, B3, etc. are assumed across subjects, they represent the relationship 
between the mean outcome and time or age t. For example, studies of mean marital 
quality over time suggest a more or less homogenous linear decline over the course of 
marriage (Karney and Bradbury, 1995). 


6.2.1 Growth curve variability 


However, average growth curves will often conceal substantial variability in develop- 
ment that longitudinal research is designed to address. Such variability in growth (e.g. in 
the linear growth effects of Xin = t) may be correlated with variability in the individual 
levels on the outcome, leading to growth curves with multivariate random effects. For 
instance, a commonly observed effect in panel and growth curve models is regression to 
the mean, whereby higher growth occurs from lower base levels (so that growth and 
level are inversely related). An alternative to the notation {a;, 6;} in Equation (6.2) is 
the multivariate one b,x, k = 1, 2. For a linear growth curve with random intercepts and 
growth rates, a frequently used formulation is 


Y= bia bnt Eu (6.9a) 
with 
£; ır ~ Nr(0, $) 
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and the random effects b;, 1:2 following their own density, such as 
bi, bY Na( Hp, Xs) 


The mean values of the b; in this case would be the intercept wp; and average linear 
growth rate pp. 

Given the role of b; in representing individual variations, including correlations 
between the growth paths and the levels of each subject, it may become more reasonable 
after introducing the b; to assume that the s are independent. Hence, a simplifying 
assumption 


u ~ N(0, oD) (6.9b) 


may be made and assessed against other schemes involving a general unstructured 
dispersion matrix $} or some specified time dependence (e.g. AR(1) dependence). 
Other questions of interest might include establishing whether variations in linear 
growth rates bn could be explained by other fixed attributes W; of individuals: for 
example, whether differential declines in marital quality are related to initial spouse age, 
or to spouse education. 

If individuals i have different observation times, or are nested hierarchically within 
groups j, then more complex growth curve models have been suggested. Diggle (1988) 
proposes a model for panel data in which observation times vp may vary between 
subjects. Then the series for individual 1 may be modelled as 


Yii) = (vir) Wii) Er a (6.10) 


This representation contains a simple measurement error or white noise term €;, as well 
as autoregressive errors W;(v). The prior for the latter would incorporate a model for 
correlation p(A) between successive observations according to the time difference 
A; = v; 1 — v; between readings. The error association typically decreases in A, since 
measurements closer in time tend to be more strongly associated. The model includes 
constant subject level errors a; which may depend upon covariates. These stable effects 
may also pre-multiply covariates, including the times v; themselves, in which case they 
become variable growth rates. 

Suppose individuals i are classified by group j=1,..J, as well as by indi- 
viduals i within groups. Assume for simplicity equally spaced observation times for 
all subjects. Then the corresponding model to Equation (6.10) contains measurement 
error, as well as autoregressive dependence, at observation level, constant effects aj 
specific to subject i and group j, and growth curve parameters varying over group or 
over individuals. For example, a group varying linear growth model might take the 
form 


Yijt = A; B;t Qij  @Cijt  Eijt (6.1 1) 
where the e; are autoregressive with 
Cijt = Y1@ij, 1-1 Mit 


and both the g; and uj; are exchangeable measurement errors. The group effects (varying 
intercepts and linear growth rates) might be taken to have bivariate dependence. 
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6.2.2 The linear mixed model 


The growth curve with random variation in trajectory parameters over subjects, or 
subjects within groups, is a special case of the linear normal mixed model for subjects 
i=1,..N and times t=1,..T 


Yr = Xip Wibi Eu (6.12) 


In this specification, the g; are commonly taken to be distributed according to the 
homoscedastic assumption (6.9b), W; is usually but not necessarily a g-dimensional 
subset of regression vector X;, and b; a qx 1 random effect with mean zero and 
covariance matrix =». If the density of b; is multivariate normal, then the mean 
and variance of Y; unconditionally on b;, are respectively 


EY) = XiB 
and 
V(¥) = 07 Tr WoW; 
There may be additional grouping variables in 6.12, e.g. exam results over time for 
pupils 7 within schools j, or clinical measures over time for patients i within hospitals j. 
In this case, the multi-level random effects b;; may be assumed, or b; may be related to 
fixed characteristics of the higher level grouping. 

Suppose a Wishart prior is assumed on >, with degrees of freedom r, and scale 
matrix R,, and further that t = o~? is Gamma with parameters vı and v and that B has 
a multivariate normal prior with mean By and dispersion matrix Bo. Then the full 
conditional densities of B, >, and t can be obtained (see Chib, 1995) by rewriting 
Equation (6.12) as in Equation (6.7): 

Yi — Wibi = XiB €n 
The conditional density of B then has the same form as Equations (6.8a)—(6.8c) with 
dispersion matrix $ = o°Ir. The conditional density of t is Gamma with parameters 


V1 NT /2 


and 


vn 055) Sos 


i=l,N t=1,T 
while that of Sa: is Wishart with degrees of freedom r} N, and scale matrix R,* where 
(R$)! = R7! bib. 
i=l, N 


analogous to Equation (6.8d). Similarly, the full conditional of the b; conditions on ß, 
and (6.12) can be rewritten as 


Hy = Yu — XiB = Wibi eit 
The b; then have variances V;, given by 
V=; WW; 
and means 
Vit Wi; 
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With suitable adaptations, the linear mixed model may be applied with discrete out- 
comes. An alternative parameterisation in models such as (6.12) is considered by Chib et 
al. (1998) in a discussion of Poisson outcomes. When W; is a subset of X;, e.g. Wi = Xix 
in the case q = 1, it may be preferable to merge the ‘fixed effect’ term Xi, with the 
random effect term X;.b;, such that b; has a non zero mean, for example b; ~ N(up, 03), 
where u, = By. There is then no overlap between the fixed effect predictors X and the 
random effects predictors W. This may provide improved identifiability in Markov 
Chain Monte Carlo sampling 


6.2.3 Variable autoregressive parameters 


While random coefficients attached to powers of time may represent diversity in growth 
curves, another option is to incorporate elements of an autoregressive approach. One 
possibility is the autoregressive latent trait model mentioned above. Another is to allow 
AR coefficients varying over subjects. Thus, for homogenous impacts of previous values 
in a trajectory on current values, and with exogenous attributes X; (e.g. treatments, 
gender, etc.) as additional influences, a growth model in p lags might take the form 


Yi = Po PiYi, -1  P2Yi -2 =... BXi Eat (6.13) 


where po; represent different levels of the outcome for individuals. To allow for differing 
shapes in the growth trajectories of subjects, one or more of the lag coefficients 
P1» P2, : - may be allowed to be random as well as the intercepts poọ;. Then {o;, Pij- Ppit 
are taken to be multivariate Normal or Student ¢ with means {po, P1,- - Pp}. This may 
allow greater flexibility than differences in linear or polynomial growth rates, in time t 
or log time. One may also introduce autoregressive variability in the model for the error 
structure. For example, an AR(1) error structure in the errors of the means of a Poisson 
outcome might be taken to vary over subjects: 


log uy = BX Qi eit 
Cit = ViCit-1 Eit 


In a study of suicide trends, Congdon (2001) compares this ‘differential persistence’ 
model with a model allowing variable growth rates. 


Example 6.1 Growth curve analysis: plasma citrate readings An example of growth 
curve variability with continuous outcomes as in Equation (6.8a), and possibly auto- 
correlated errors, is considered by Hand and Crowder (1996). The measurements at 
t = 1,..5 equally spaced times and on i = 1,..10 subjects are of plasma citrate concen- 
trations in micromoles per litre. Hand and Crowder assume a two stage growth model 
with varying intercepts and growth rates, and also allow for AR(1) errors in the 
observation errors e;r, so that 


Yr =bn bnt eit (6.14a) 
for t > 1, with 
Cit = Vi, t-1 Uit (6.14b) 


where the ux are exchangeable with mean 0 and variance o. Instead of the direct form 
(6.14), one may accommodate this form of structured errors by transforming the 
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regression term for times ¢>1. This entails subtracting y, Yj, 1 = yıb 
yibn(t— 1) Yei, mı from Equation (6.14a), to give 


Ya =Y Yi 1 (l-y)ba balt-—yit Yı) uit 


For the first observation (t = 1), one option is to take e; = y,e uj as a distinct error 
term, with its own variance of. 

Here the intercepts b; and growth rates bn are initially assumed to be drawn from a 
bivariate Normal density, before considering whether extensions such as a mixture of 
Normals for the 6, are supported by the data. To assess fit and aid model choice we 
consider, as earlier, a predictive loss criterion, and a pseudo marginal likelihood 
measure’. In subsequent examples the DIC is also used. The mean intercept and slope 
bp; and pp, are assigned N (0, 108) and N(0, 10°) priors respectively. The correlation y; 
is taken to be N(0, 1), and so is not constrained to stationarity. For the precision matrix 
A of the 6, a Wishart prior with an identity scale matrix and 2 d.f. is assumed. 

For the standard multivariate Normal, three chains are run for 20 000 iterations with 
dispersed starting values’; convergence was obtained by around 1000 iterations. The 
resulting estimates (from iterations 1000-20000) show intercepts to be negatively 
related to slopes, as can be seen from the correlation Cj2 in Table 6.1. The clearest 
reductions over time in subject concentrations (i.e. the most negative b) are in those 
subjects (4, 9, 10) with initially high readings, illustrating a regression to the mean. 
However, the average coefficient for linear growth pp straddles zero, and there is only a 
65% probability that mp is negative (see ‘test? in Program 6.1). There is a weakly 
positive auto-correlation in the residuals ex; this might suggest that a simple uncorrel- 
ated error structure be adopted instead, but there is a clear worsening of fit when this is 
actually applied. 

The second model involves a discrete mixture of two sub- groups of patients, since 
some subjects have fairly clear evidence of decline, while others have static or even 
slightly rising readings. An identifiability constraint is applied on the intercepts of the 
two groups. Convergence with this model is much slower (only by 25 000 iterations in a 
three chain run of 50000 iterations). This analysis shows a low intercept group with 
positive (but insignificant) trend in time, and a high intercept group with trend coeffi- 
cient biased to negative values (95% interval from —8.1 to 2.1). This model has a similar 
pseudo-marginal likelihood to the one group multivariate Normal, but a clearly worse 
predictive loss: the precision of samples of new data is worsened. 


Example 6.2 Hypertension trial As an example of a hierarchical data set, Brown and 
Prescott (1999) present data on 7 = 1,..288 patients randomised to receive one of three 


' For the predictive loss criterion (see Gelfand and Ghosh (1998), Sahu et al. (1997) and Ibrahim et al. (2001), 
‘new data’ Z; are sampled from f (Z|0), where fis the same density assumed for Y; and 0 are samples from the 
posterior density p(6| Y). Let ¢;, and ọ; be the posterior mean and variance of the Zy. Then for w positive, one 


possible criterion has the form 
w 
DEt (H) 
7 w 1 


i 


This criterion would be compared between models at selected values of w, typical values being w = 1, w = 10 
and w = 10000. If the Y; are missing or censored, then the second term in the criterion is omitted and only the 
precision of prediction is relevant. Another useful diagnostic tool is the Monte Carlo estimate of the condi- 
tional predictive ordinate, obtained as the harmonic mean of the likelihood for the {i,t}th observation 
(Gelfand and Dey, 1994). The product of the CPOs yields a pseudo-marginal likelihood. 

? Null parameter values and the upper and lower 2.5th points of a trial run. 
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Table 6.1 Plasma citrate, AR(1) model with variable patient effects 


Mean St.devn. 2.50% Median 97.50% 


Overall intercept and growth rate 


Mp1 118.8 8.3 102.3 118.7 135.5 
Mp2 —0.56 1.74 —4.07 —0.54 2.94 


Dispersion matrix for subject effects 


bit 698.6 408.3 264.6 592.8 1776.0 
p12 —115.5 78.5 —320.1 —95.3 —30.8 
p22 20.9 16.7 3.6 16.6 64.1 


Correlation between intercepts and slopes 


Cio —0.97 0.06 —1.00 —0.99 —0.85 
Intercepts 

Subject 1 88.2 1.9 84.5 88.2 91.9 
Subject 2 115.9 1.4 113.1 116.0 118.5 
Subject 3 127.2 1.5 124.4 127.2 130.1 
Subject 4 149.4 1.8 145.7 149.4 153.0 
Subject 5 102.7 1.5 99.6 102.7 105.6 
Subject 6 108.3 1.6 105.3 108.3 111.6 
Subject 7 83.9 2.0 80.0 84.0 87.8 
Subject 8 116.1 1.4 113.4 116.1 118.7 
Subject 9 157.7 2.1 153.5 157.8 161.8 
Subject 10 141.6 1.7 138.2 141.6 144.9 
Growth Rates 

Subject 1 4.79 1.88 1.15 4.78 8.53 
Subject 2 0.05 1.36 —2.56 0.02 2.87 
Subject 3 —2.22 1.45 —5.14 —2.20 0.57 
Subject 4 —5.40 1.84 —8.98 —5.42 —1.65 
Subject 5 2.28 1.53 —0.64 2.25 5.42 
Subject 6 0.67 1.58 —2.64 0.72 3.64 
Subject 7 5.06 1.96 1.31 5.03 8.92 
Subject 8 —0.07 1.35 —2.68 —0.09 2.63 
Subject 9 —6.74 2.08 —10.79 —6.76 —2.53 
Subject 10 —4.57 1.69 —7.92 —4.56 —1.23 


Correlation of residuals 


yı 0.16 0.19 —0.22 0.16 0.55 


Residual variance 


o? 219.8 54.5 136.2 212.1 348.8 


drug treatments for hypertension (A=Carvedilol, B=Nifedipine, C=Atenolol), with 
drug A being the new drug and the other two being existing standard treatments. 
Patients are allocated to one of j = 1,..29 clinics. The analysis here considers a pre- 
treatment (week 1) baseline reading B; of Diastolic Blood Pressure (DBP), and four 
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post-treatment readings Y; at two weekly intervals (weeks 3, 5, 7 and 9 after treatment). 
Treatment success is judged in terms of reducing blood pressure. 


A first analysis of these data (Model A) is a fixed effects model without random effects 
over patients or patients x clinics. It involves just baseline B and treatment n effects, with 
the new treatment A as reference category 

Yr=p BB Ne Nc eit 

Because a single baseline measure is missing, an imputation is made based on random 
missingness (see Section 6.5) such that the response mechanism itself does not need to 
be modelled. Convergence with three chains was achieved at under 500 iterations, and 
the summary is based on iterations 500-5000. The results of this analysis (Table 6.2) 
show the lowest DBP readings for drug C, once baseline morbidity is controlled for; 
centred treatment effects may be obtained by subtracting na( = 0), Nng and nc from 
their mean. A low CPO is apparent for the third visit of patient 249, with DBP reading 
of 140, compared to previous readings of 120 and 118. The predictive loss criterion, 
the pseudo-marginal likelihood and DIC? are, respectively, 126 353 (for w = 1), —3925 
and 9569. 

Introducing a subject level random intercept leads to the model (Model B in Program 
6.2A) 


Yı=u 5 BB NB Nc Fit 


Define n; = & b; as in Equation (6.4a). Then the variance of the constant subject 
effects is determined by the correlation t as in Equation (6.4b), which is assigned a 


Table 6.2 Hypertension trial, alternative models (without 
clinic effects) 


No random effects Mean St. devn. 2.5% 97.5% 


Nifedipine (ng) 108 0.65 -2.52 0.05 
Atenolol (nc) —2.95 0.64 —4.22  —1.70 
Baseline (B) 0.51 0.06 0.39 0.62 
m 40.4 5.8 29.1 52.7 


Normal subject random effect 


Var(b) 39.9 4.4 32.0 48.9 
Nifedipine (np) —1.34 1.02 -3.31 0.68 
Atenolol (ne) —3.16 1.01 —§.22 1.19 
Baseline (B) 0.49 0.09 0.32 0.66 
m 42.2 9.1 24.6 60.3 

1 0.51 0.03 0.44 0.57 


Dirichlet process 


Nifedipine (ng) —1.03 1.01 —2.97 0.98 
Atenolol (ne) -9.73 1.04 -4.72  —0.66 
Baseline (B) 0.44 0.09 0.27 0.61 
m 46.6 9.4 28.0 65.1 


3 This DIC is based on minus twice the likelihood. 
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uniform prior over (0, 1). Estimates for this model show slightly enhanced treatment 
effects in absolute terms, but also reduced precision, so that a more substantial portion 
of the density of ng is above zero. In particular, its effect is indistinguishable from that 
of drug A. The model assessment criteria all agree on the gain in introducing permanent 
patient random effects; their distribution is shown in Figure 6.1. The predictive loss 
criterion falls to 67 400 (with w=1), the log pseudo-marginal likelihood improves from 
—3925 to —3671, and the DIC falls to 9030. 

Following Butler and Louis (1992), one possible model extension (Model C in 
Program 6.2) is a non-parametric mixture of sub-populations on the effects b;. For 
example, one option they suggest is sub-populations having different means and vari- 
ances, but with the sub-population means summing to zero; they also apply the non- 
parametric maximum likelihood model of Laird (1978). Parametric assumptions about 
the permanent subject effects are also avoided by a Dirichlet process mixture (Escobar 
and West, 1998). So it is assumed that 


bi ~ N(vi, b;) 

where 
(i, 6) ~ G 
G ~ D(Go, a) 


In practice, there will be clustering of the v; and ọ; values, and a maximum of J = 20 
clusters is assumed for the 288 patients The baseline prior Go has the form 


v; ~ NO, fid;) 


6.15 
$; ~ G(1, 0.001); j= 1,..,J oP) 


60 


0 
14.0 10.0 6.0 2.0 2.0 6.0 10.0 14.0 18.0 
-12.0 -8.0 —4.0 0.0 4.0 8.0 12.0 16.0 20.0 


Figure 6.1 Patient random effects (Model B) 
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where the f; determine the relative spread of cluster means and patient specific random 
effects, and are assigned a G(1, 1) prior. The category C; to which patient 7 is assigned 
has prior 


C; ~ Categorical(p,, p2,..- pJ) 


Only J* clusters (between 1 and J) are selected at any iteration as appropriate for one or 
more observations. The mixture weights pi, p2,...p; are determined by the stick- 
breaking method (see Sethuraman 1994, and Sethuraman and Tiwari, 1981). Thus, let 
ri, 12,--ry_1 be a sequence of Beta (1, a) random variables (and ry = 1) , and set 


P=" 
p2 = 121 — rı) 
p3 = 1731 — r2)(1 — rı) 


The precision parameter a is preset here to 1, though it may be updated using the 
algorithm described by Escobar and West, 1998, p. 9). Sensitivity to alternative values 
of a may be assessed by setting alternative values; larger values of a imply more clusters 
J and greater differentiation between the 5;. 

A two chain run shows convergence after 2500 iterations, and the summary is based 
on iterations 2500-5000. The posterior averages of J* is 5.9, and Figure 6.2 shows most 
of the density for J* is contained under 10 clusters. The predictive loss criterion falls to 
67 140 (with w = 1), and the log pseudo-marginal likelihood and DIC also improve over 
Model B (respectively, —3648 and 9026). The treatment effects are absolutely smaller 
than in Model B, but have similar precision, while the baseline effect is smaller than in 
the first two models. Figure 6.3 shows a more peaked density of patient effects than 
Figure 6.1. 

To introduce the information on clinics into the analysis, one may adopt a form of the 
multi-level growth curve model, as in Equation (6.11). See Program 6.2, Model D, 
which has data input in a different form. Corresponding to the broad decline over time 
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Figure 6.2 Number of clusters (two chains) 
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Figure 6.3 Patient random effects, Model C 


in DBP readings, this model introduces a linear growth effect at clinic level. Intercepts 
and baseline effects also vary at clinic level, and there is an error term additionally at 
patient-clinic level. Thus with j denoting clinic and patient indices i= 1,..n; now 
regarded as being nested within j (so $;n; = 288), the model has the form 


Yin =By Byt Bx By NB Ne Ty ei 
with 
Cit = Vy, t-1  Uijt (6.16) 


and with uj; an unstructured measurement error. The initial conditions ej; have a 
distinct variance term. The clinic effects B,,(k = 1,3) have means and variances 
{P-B &-Bx}. 

Convergence with three chains is achieved after 5000 iterations with the aid of the 
over-relaxation option (there being a delayed convergence in w.B, and w.B3). This model 
confirms a significant linear decline in the blood pressure readings with 95% interval for 
w.B, between —1.48 and —0.62. This model confirms the apparently beneficial effect of 
drug C (Table 6.3). The baseline parameter remains important (averaging 0.40 over 


Table 6.3 Hypertension trial: multi-level model, parameter 
Summaries (iterations 5000-7500) 


Mean St. devn. 2.5% 97.5% 


Np —1.32 0.98 —3.33 0.68 
ne —3.08 0.98 -4.95  —1.21 
w.B, 54.9 5.9 45.7 66.5 

uB —1.03 0.22 —1.48  —0.62 
w.B3 0.40 0.06 0.29 0.50 


y 0.89 0.05 0.79 0.98 
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clinics), though smaller than in the first three models. This form of dependence on the 
initial observation coexists with autoregressive dependence in the errors, with y in 
Equation (6.16) averaging 0.89. 


Example 6.3 Protein content in milk Rahiala (1999) analyses data on the percentage 
protein content y; of cow milk according to diet regime — see also Diggle et al. (1994). 52 
cows were observed for a maximum of 19 weeks, though some were observed for less 
than this — the numbers of weeks vary from 14-19. There are a few missing cases in 
intervening weeks, before the end of the observation period on particular cows; these are 
assumed to be missing at random (Table 6.4). 


One might transform the percentage variables to ensure they are in the appropriate 
range when predictions are transformed back, but the analysis here follows Rahiala in 
treating the data as metric. We also follow Rahiala in fitting a lag 5 model which 
conditions on the first five observations — with non-zero lags assumed at 1, 2, 3 and 5 
weeks, but varying coefficients only at lag 2 and 5 and the mean lag at 5 taken as zero. 
Then with notation as in Equation (6.13) 


Yi ~ NOt 07) 
Vit = Po Pi Yi, t-1 P; Y;, t—-2 P3 Y;, t—-3 Psi Y; 5 PBX 
{Poi Pris Psit ~ N3 [ws Ep] 


where W = (Po, py, 0). The model also includes a single covariate X; for diet type 
(barley=1, 0=lupins) with its coefficient treated as a fixed effect. Note that in BUGS 
the data is provided in a 52 x 19 format even though some cows are only observed for 
14 weeks. 

Because not all the first five values in each animal’s series is observed (e.g. y2, 2 is 
missing), the first five points must still be modelled in some way under the conditional 
approach, even though they are not included in the differential autoregression scheme. 
One might assume a random effects model for the first five periods, with the observations 
Y;, for animal i drawn from a Normal density with variance o2 and means m.yi, 
and the means themselves drawn randomly from a normal density with mean M and 
variance 0%. 


Table 6.4 Variable lag effects, growth curve model 


Mean St. devn. 2.5% 97.5% 


Correlations between random effects 


Between Intercept and lag2 —0.03 0.62 —0.96 0.97 
Between Intercept and lag 5 —0.18 0.60 —0.98 0.92 
Between Lag 2 and Lag 5 —0.69 0.33 —0.99 0.18 


Coefficients (pọ and p, are mean random effects, others are fixed effects) 


Intercept (po) 0.81 0.10 0.58 1.01 
Lag 1 (pı) 0.48 0.04 0.41 0.57 
Lag 2 (p2) 0.08 0.04 0.00 0.16 
Lag 3 (p3) 0.19 0.04 0.12 0.27 


Diet (8) 0.06 0.02 0.01 0.11 
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With a three chain run, convergence is apparent from iteration 10000 and the 
summary is based on iterations 10000-40000. There are clear positive lags at 1 and 3 
weeks, and also a negative correlation between lag 2 and lag 5 growth coefficients. The 
effect of the lupin diet is positive. 

A second model allows the mean coefficient p; to be non-zero. This option shows a 
clear improvement in the predictive loss criterion of footnote 1 (from 183 to 171 when 
w = 1). This model raises pọ to around 1, while p; has mean —0.09 and 95% credible 
interval (—0.15, —0.03). 


6.3 LONGITUDINAL DISCRETE DATA: BINARY, ORDINAL AND 
MULTINOMIAL AND POISSON PANEL DATA 


As discussed in Chapter 3 and elsewhere, discrete outcomes may often be modelled 
in terms of latent continuous variables. Thus, underlying a panel of binary observations 
Yi, we may posit a continuous latent variable or underlying propensity Z;, such 
that Y;, = 1 if Zy is positive, and Yy = 0 if Zy is negative. Following Heckman (1981), 
one may formulate the model for Z; in terms of measured or endogenous effects V; (for 
example, Vy = BX;,, where Xy are known predictors) and a stochastic error €; so that 


Zin = Vu Fit (6.17) 


Then Z; > 0 if and only if Y;; = 1, while Zy < 0 if and only Yy = 0. In a frequently used 
Bayesian approach developed by Albert and Chib (1993), the latent scale may be 
sampled via truncated sampling, with the truncation ranges determined by the observed 
Y;,. It is assumed that the probability of success is expressed as m; = F(), where F(.) isa 
distribution function, and so lies between 0 and 1. So a success occurs according to 


Pr(Yir = 1) = Pr(Zir > 0) = Pr(er > — Vi) = 1 — F(— Vi) 


For forms of F that are symmetric about zero, such as the cumulative distribution 
function of a standard normal variable, the last element of this expression equals F(V;;). 

If the chosen distribution function is the cumulative Normal, then Z may be sampled 
from a truncated normal: truncation is to the right (with ceiling zero) if the observation 
is Y;,= 0, and to the left by zero if Y;,= 1. To approximate a logit link, Z;, can be 
sampled from a Student ¢ density with eight degrees of freedom, since, following Albert 
and Chib (1993), a ¢(8) variable is approximately 0.634 times a logistic variable. This 
sampling based approach to the logit link additionally allows for outlier detection if the 
scale mixture version of the Student ¢ density is used, rather than the direct Student t 
form. The scale mixture option retains truncated Normal sampling but adds a mixture 
variables \;, such that 


Zit ~ N(Bxi, 4; ') IL, U) 


with à; sampled from a Gamma density G(4, 4). The resulting regression coefficients 
need to be scaled from the ¢(8) to the logistic. L = 0 when Y; = 1 and U = 0 when 
Yir =0. 
With Y; still a binary outcome, the Z; in Equation (6.17) may be expressed as 
Zin =PBXir Qi Kr Er (6.18) 


where the k; are interpreted as period effects invariant over individuals (e.g. reflec- 
ting moves in national economic conditions on firm level outcomes), and the a; are 
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permanent effects attached to the subjects. For binary outcomes, restrictions are needed 
for identifiability. Thus, if a constant variance ¢ of the g; is assumed, then it is necessary 
that b = 1 (or possibly some other preset value). Note, however, that time varying 
variances , may be identifiable provided one of them (e.g. ;) is set to a pre-specified 
value. Heckman extended this error structure to include coefficients on the a;. This is 
known as the one factor model 


Zin = BX NO Ky Er (6.19) 


where, if the gy are taken as uncorrelated, the ^; may be used to describe the correlation 
between time points ¢ and s (see Example 6.6). For count outcomes, Dagne (1999) 
proposes a similar model with the loadings either over subjects or times. Thus, 
Yi ~ Poi(p,,), and for loadings varying by subject 


Mj =A; BX Er 


Ordinal panel data Models for ordinal responses over time are important because in 
many settings involving human subjects, classifications are on a graded scale, with 
precise quantification not being possible. Examples include pre- and post-treatment 
observations on rankings of illness symptoms (e.g. no symptoms, mild, definite) or 
changed illness states, as well as survey questions on changing views on controversial 
topics. As discussed in Chapter 3, a continuous scale may often be envisaged to underlie 
the grading, with a series of thresholds 71, 72,..tc_; defining which of C categories a 
subject lies in. Then with Z; taking one of the forms as above, for example 


Zi = Bo BX ei 
we have 


1 if Zi < Ti 

2 if tT) < Zu < T 
Vrm Ea 

C-1 if Tc-2 < Zi <TC] 

C if Zi > Tc-1 


Alternative parameterisations are possible to ensure identifiability: either taking By = 0, 
or 7; = 0, or tc = C ensure that the mean of the Z is identified (Long, 1997, p. 122). The 
variance of Z may be identified by taking £ to be N(0, 1). This leads to the ordinal probit 
model, whereas taking g to be Student 7(8) have variance 1 leads to the ordinal logit 
model. 

In panel setting one might consider shifts in the location of thresholds by making the 
cut-points time specific — for example, if the analysis was intended to assess whether 
there had been a shift in attitudes. Random variations in intercepts between subjects or 
variation in trends across time (random time slopes) may also be modelled in panel 
settings because of the repetition of observations over subjects. 


6.3.1 Beta-binomial mixture for panel data 


Many repeated observations of social choice processes are available as binary series at 
times 1, 2,... 7 and aggregate to binomial series for total subject populations or sub- 
populations of subjects. For example, the events migration, job change or divorce 
are binary though Y; = 1 may sometimes include more than one event. As Davies 


LONGITUDINAL DISCRETE DATA: BINARY, ORDINAL AND MULTINOMIAL 245 


et al. (1982) point out, such time series may show three kinds of systematic variation: 
heterogeneity in event probabilities across individuals or sub-populations, non- 
stationarity in the event rates at individual or aggregate level or both, and event history 
effects. The latter are exemplified by first order effects, when the state occupied at time 
t— 1 influences the state occupied at time ¢. The factors underlying these choice 
sequences may be heterogeneous over sub-populations, as well as varying over time. 

Suppose the binary series are of length T, with 27 possible series being potentially 
observable, For instance, if T = 3, one may observe the sequences {1,1,1}, {1,1,0}, 
{1,0,1}, {0,1,1}, {1,0,0}, {0,1,0}, {0,0,1}, {0,0,0}. One may observe population sub- 
groups classified by demographic and social attributes which differ in terms of their 
distribution over these patterns. For example, if the event were migration or purchasing 
trips then groups with high event probabilities will be concentrated in sequences with 
higher occurrences of ones, e.g. higher proportions of sequences such as {1,1,1} or 
{1,0,1}. This is the heterogeneity issue. There may also be non-stationarity, if for 
example the sequence {1,0,1} is more frequent than the sequence {1,1,0} because the 
event rate is higher at time 3 than at time 2. 

Let px; denote the aggregate event rate at time ¢ for subgroup k. Often heterogeneity is 
modelled within a logit or other link, so that if y,, denotes the number undergoing the 
event at time ¢ from ny at risk, then one might stipulate 


Yke ~ Bin( Pre, Nk) 

logit( Pk) =a: bk ek 
where the ex; are Normal. However, heterogeneity in discrete outcomes over time might 
also be modelled in terms of conjugate mixture densities. Thus, heterogeneity in event 
probabilities p over individuals or sub-populations may be represented by a beta 
mixture g(p). In the case of variations at individual level, the beta parameters may be 
linked to selected characteristics of individuals. 

Suppose we have a stationary series with the rates p; at different times ¢ being 

constant, p; = p. Then the probability of different sequences such as {1,1,0} or {1,1,1} 
may be modelled in terms of the moments 


1 
W = Elp’] = |, eo $e. (6.20) 


For a series of length T, Bi; is equivalent to the probability of exactly j events occurring — 
regardless of their sequencing within the T points. For example, if T = 3, the probabil- 
ity of an event at every point in the series may be denoted 


1 
TH = | (p.p-p)g(p)dp = ELP] = p; 


and the probability of the sequences {1,1,0}, {0,1,1}, and {1,0,1} are equal since 
1 
mo = | p-p: 0 -pgod = Elp? ~ p°1= p = wh 
1 
Tou = | (C= p): p: p)gp)dp = EIP — p°] = m, — u5 


1 
T101 = j (p-(1—p)-p)g(p)dp = Elp? — p°] = py — ps 
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If, however, the chance of the event is varying over time (i.e. there is non-stationarity), 
with pı, p2,..pr possibly different, this model has to be generalised. Suppose the form 
of heterogeneity is fixed and, taking the first period as reference, may be denoted g(p1). 
A simple option to account for non-stationarity is linear scaling of event probabilities 


P= oP 


This ‘scaling’ is somewhat analogous to using a piecewise exponential in hazard regres- 
sion, with the ò modelling the fluctuations in the p, without necessarily presupposing a 
parametric form. Then the chance of the series {1,1,1} would be 


1 
T111 =| (pı - Sipi : 82pi)g(pi)ap 
0 


The likelihood for T = 3 consists of Nj; persons with pattern {1,1,1}, M110 persons 
with pattern {1,1,0}, etc. If the chance of the event is low, then linear scaling may be 
appropriate even though in general linear scaling may lead to rates exceeding one — 
unless the maximum rate (rather than the first rate) is taken as a reference, and all 8, are 
then under one. 

However, one may also specify logistic scaling such that there is appropriate {0,1} 
bounding of probabilities. Thus, if successive odds ratios are linked according to 


[p:/( — pd] = KL pi /U — pid] 


then successive logits are linked as 
logit( pr) = log (k,)logit(p1) 


Davies et al. adopt the parameterisation 


P:/ (A = pd) = A/K)P1/C = p) (6.21) 
so that 
Pr=%pi t>1 
where 
9,=1/(ki—Kp1 Pi) (6.22) 


Similar models may be developed for Poisson series. 


Example 6.4 Migration histories As an example of a conjugate mixture approach, 
consider data from Crouchley et al. (1982) on ten year migration histories for 10000 
residents of Wisconsin, USA, disaggregated into k = 1,..4 sub-populations defined 
by two age bands and two household tenures (Table 6.5). Sequences of urban migra- 
tions are defined by five two year periods (T = 5), with MMMMM the same as 
{1,1,1,1,1} above while MMMMS denotes four migrations followed by ‘staying’ or 
non-migration. 


Considerations of the housing market and lifecycle migration processes indicate that 
the parameters 0x: in Equation (6.22) are likely to vary both over sub-populations (e.g. 
younger people are more likely to migrate) and over time. 

There are i = 1,..,32 possible migration sequences of five outcomes (migrate or stay), 
so that the likelihood is defined over 32 x 5 x 4 points. Let dy = 1 ifa migration occurs at 
period ¢ (e.g. da = 1, dn = 1, di = 0, di4 = 1, dis = 1 for the sequence MMSMM) and 
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Table 6.5 Inter-urban moves over five two-year 
periods in Milwaukee, Wisconsin, USA 


Renters Owners Move Pattern 


(M=Migrate, 

25-44 46-64 25-44 46-64 S=Not Migrate) 
511 573 739 2385 SSSSS 

222; 125 308 222 SSSSM 

146 103 294 232 SSSMS 

89 30 87 17 SSSMM 

90 77 317 343 SSMSS 

43 24 51 22 SSMSM 

27 16 62 19 SSMMS 

28 6 38 5 SSMMM 


52 65 250 250 SMSSS 
17 20 48 14 SMSSM 
26 19 60 25 SMSMS 
8 4 10 3 SMSMM 
8 9 54 21 SMMSS 
11 3 18 1 SMMSM 
10 3 21 1 SMMMS 
4 1 8 2 SMMMM 
41 29 134 229 MSSSS 
16 15 23 10 MSSSM 
19 13 36 25 MSSMS 
2 4 1 0 MSSMM 
11 10 69 24 MSMSS 
11 2 15 3 MSMSM 
1 9 13 2 MSMMS 
2 2 2 0 MSMMM 
T 5 40 18 MMSSS 
4 2 9 2 MMSSM 
8 1 15 3 MMSMS 
1 0 5 0 MMSMM 
8 1 22 7 MMMSS 
3 2 7 2 MMMSM 
5 0 9 2 MMMMS 
6 3 5 0 MMMMM 


nix be the number of persons of type k undergoing that sequence. Then the likelihood for 
sequence i, time t and sub-population k is defined as 


Vike = diik 

Vie ~ Bin(Ver, Nik) 
As Table 6.5 shows, the event rate is low. However, one may apply the odds ratio scaling 
model in Equations (6.21)-(6.22), so that for the Ath group the model for pg, combines 
the updating parameters 0;;(¢ > 1) with the initial period baseline migration rates px; 


which are assigned a mixing density g(p,1). Thus a beta prior might be assumed for ppi, 
and then 


logit(px:) = logit(pi1) — log (K) 
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where «K; is as in Equation (6.21). Gamma G(1, 0.001) priors are adopted for the «z. 

The estimates* of 0x, (for k = 1, 4 and t > 1) for the four sub-populations are as in 
Table 6.6. These show greatest non-stationarity for the young renters sub-population. 

The elements needed to obtain the variances of migration rates in the groups, namely 
W and «4, may be monitored, as may the chances of different sequences (allowing for 
non-stationarity). The variances are then obtained from the posterior means of p>, and 
Mig AS Way — (Wig). The variances are higher in the renter groups, reflecting a well 
known feature of migration behaviour. As an example of a sequence probability 
Program 6.4 includes the probability of the sequence MMMMS by group. 


Example 6.5 Ratings of schizophrenia To illustrate ordinal outcomes y;; over time, we 
follow Hedecker and Gibbons (1994) in considering data on the impacts of alternative 
drug treatments on symptom severity rankings in 324 schizophrenic patients, collected 
as part of the NIMH Schizophrenic Collaborative Study. There were originally four 


Table 6.6 Panel migrations, parameter summary 


Ok, ; Parameters Mean St. devn. 2.5% 97.5% 


(by subpopulation & time) 

O12 1.18 0.13 0.96 1.44 
013 1.79 0.17 1.47 2.15 
O14 2.54 0.23 2.12 3.03 

O15 3.11 0.27 2.62 3.68 

022 1.34 0.17 1.04 1.72 

023 1.63 0.19 1.30 2.05 
O24 2.08 0.24 1.67 2.61 

025 2.37 0.26 1.89 2.93 

032 1.42 0.08 1.26 1.59 

033 1.73 0.10 1.56 1.93 

034 1.62 0.09 1.46 1.82 

035 1.55 0.09 1.38 1.72 
042 1.06 0.08 0.91 1.22 

043 1.37 0.09 1.18 1.56 
044 1.01 0.07 0.87 1.16 

845 0.91 0.07 0.78 1.05 
Wi, k 

Mid 0.10480 0.00815 0.08928 0.12170 
Myo 0.08775 0.00813 0.07202 0.10390 
M43 0.14820 0.00666 0.13530 0.16130 
Uig 0.08537 0.00457 0.07698 0.09466 
Bok 

bo 0.01104 0.00173 0.00797 0.01480 
W22 0.00777 0.00143 0.00519 0.01079 
W23 0.02201 0.00198 0.01831 0.02602 
u34 0.00731 0.00078 0.00593 0.00896 


4 A two chain run shows convergence at under 500 iterations. 


LONGITUDINAL DISCRETE DATA: BINARY, ORDINAL AND MULTINOMIAL 249 


treatments: chloropromazine, fluphenazine, thioridazine, and placebo. Since previous 
analysis revealed similar effects of the three anti-psychotic drugs, the treatment is 
reduced to a binary comparison of any drug vs. placebo. The severity score is derived 
form item 79 of the Inpatient Multidimensional Psychiatric Scale. 


Hedecker and Gibbons collapse the seven points of that item into four: (1) normal or 
borderline mental illness; (2) mild illness; (3) marked illness; and (4) severely ill. Here we 
retain the full scale with 1=normal, 2=borderline, 3=mildly ill, 4=moderately ill, 
5=markedly ill, 6=severely ill and 7=extremely ill. More pronounced gaps might be 
anticipated in thresholds between some of these categories (e.g. 5 vs. 4) than others. 
There are three repetitions after the first (week 0) reading, which is coincident with 
treatment, namely at weeks 1, 3 and 6. 

The likelihood model is then, for subjects i= 1,.. N and times t= 1,...T and C = 7 
levels on the outcome 


Vir ~ Categorical(pi,, 1:0) 

logitQ;,;, j = Tj — Wir 

Pi,t, 1 = Qi,1, 1 

Pig, ) = Qin — Qij f=2,.....C-1 
Pi,t, C= Ts Qi,t, C-1 

Wi, = BXit 


The predictors Xx, some time varying, include time itself (specifically square root of 
weeks vy), the main treatment effect, a treatment by time interaction and the sex of the 
patient. Two models are considered (Models A and B in Program 6.5), one with a fixed 
intercept and impact of time; the other involves a bivariate model for varying intercepts 
and slopes on time 


W= ba bavi? BX BX BM? 
where 
bi ~ Na(Wp, Xb) 


the means of bip are the intercept and average growth rate, X; is treatment and X is 
gender. Identification of the C — 1 = 6 thresholds involves setting the parameter tı 
(governing the transition from normality to borderline illness) to zero and estimating 
the remaining five threshold parameters subject to monotonicity constraints. 

Table 6.7 shows for the fixed effects model’ a clear time effect, an inconclusive main 
drug effect, but a greater improvement over time for the drug group. The same effects 
show under random effects, but all coefficients are amplified. The DIC under the two 
models are 4121 and 3444, so there is clear evidence of heterogeneity. Under both 
models there seems to be a major change in threshold values for severely vs. extremely 
ill, but the thresholds are larger under random effects — the latter feature is also apparent 
in Table 3 of Hedeker and Gibbons (1994). 

As usual, one might assess sensitivity by assuming alternate forms of random effects, 
as in Example 6.2. The Exercises include fitting a scale mixture version of the random 


5 With three chains convergence is apparent at 500 and 1500 iterations under the fixed effects and random 
effects models respectively; summaries are based on iterations 500-3000 and 1500-3000, respectively. 
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Table 6.7 Schizophrenia severity ratings (posterior 
summaries) intercept and time effects fixed 


Mean St. devn. 2.5% 97.5% 


Constant 4.69 0.24 4.21 5.16 
Time —0.54 0.12 —0.77 —0.31 
Drug 0.175 0.207 —0.216 0.582 
Male 0.173 0.098 —0.018 0.365 
Drug by Time —0.72 0.13 —0.98 —0.46 
T2 1.27 0.12 1.06 1.51 
T3 2.26 0.13 2.02 2.53 
T4 3.40 0.14 3.13 3.69 
T5 4.94 0.16 4.65 5.26 
T6 T:52 0.22 7.10 7.97 


Intercept and time effects random 


Constant (Av.) 7.66 0.40 7.02 8.43 
Time (Av.) —0.85 0.16 —1.20 —0.56 
Drug 0.35 0.31 —0.33 0.93 
Male 0.18 0.24 —0.26 0.67 
Drug by Time —1.19 0.19 —1.53 —0.77 
T2 2.05 0.18 1.70 2.40 
T3 3.72 0.21 3.30 4.11 
T4 5.64 0.24 5:12 6.07 
T5 8.16 0.29 7.54 8.68 
T6 11.90 0.41 11.04 12.65 


Dispersion matrix 


Xp 3.68 0.73 2.52 5.23 
Xpi2 —0.62 0.30 —1.29 —0.10 


Xp22 1.18 0.25 0.73 1.69 


effects model (equivalent to a multivariate t, and with degrees of freedom an unknown); 
the DIC suggests a slight gain in fit for this model compared to the simple multivariate 
Normal. 


Example 6.6 Rheumatoid arthritis Additional issues arising from unequal spacing 
and missing data are illustrated by binary series Y; from Fitzmaurice and Lipsitz 
(1995) following a randomised trial of therapies for i= 1,..,51 patients with classic 
or definite rheumatoid arthritis. The treatments were auranofin and a placebo, and the 
binary response measured at baseline (t = 1) and four times thereafter (t = 2, 3, 4, 5) is 
a patient self-assessment of arthritis status (1=good, 0=poor). Some self-assessments 
(17 from 255) are missing, and Fitzmaurice and Lipsitz adopt an MCAR assumption to 
these missing responses (i.e missingness is unrelated both to any observed responses and 
covariates or to the missing outcomes themselves). The gap between the baseline and 
first post-treatment assessment is one week, with succeeding gaps being four weeks (i.e. 
the five readings are at 0, 1, 5, 9 and 13 weeks). The predictors of status are time, 
treatment, age at baseline and gender. 
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Fitzmaurice and Lipsitz adopt a logit link 
Yi, ~ Bern(m;) 
logit(mi) = Oi 
and consider models for the correlation between Y;, and Yy (t Æ s), defined by 
Disi = (Tis — TiTi) [T — mim — m) 


where Ts; is the joint probability’ of a good status at times s and ¢. An exponential 
model for the correlation takes the form 


Pist = al 


where 0 < 8 < 1. They in fact adopt a reformulation in terms of the odds ratios of good 
status at times s and t: 


ist — Tist(1 — Tis — Tit Tist) /[(Tis — Tist)(Ti — Tist) ] 
modelled as 
Pir = a/l 


with 1 < a < oo. 
The first model adopted here allows for autocorrelation via lag one autoregressive 
errors in the logit scale, with 


Yi, ~ Bern(m;) (6.23a) 


logit(t,) = Or = BXir ei (6.23b) 


Allowing for the differential spacing (in terms of time in weeks) between observations 
t — l and ¢ gives 


en = yr rte; i ug t>l (6.230) 
where var(u) = o?. Since one may write 
ei = Yei Uil 
and var(e;2) = var(e;), one may specify 
en ~ NO, o°/(1 — 3°) 


Fitzmaurice and Lipsitz assume y is between 0 and 1, and an informative prior 
constraining y to be positive is adopted here, namely a uniform prior 


y~ UC, 1) 


For identifiability, the variance of u is set at o? = 1. The missing responses for certain 
subjects are taken as missing at random, that is possibly depending on observed 
responses and covariates for these subjects, but not on the missing responses themselves. 


© One might obtain an estimate of this correlation by defining a vector of length 4 for each of the T;(T; — 1)/2 
pairs of binary observations for the ith subject, with the first item of the vector being 1 if both y;, and yy are 1, 
the second being | if y;s alone is 1, the third being 1 if y;, alone is 1, and the fourth being 1 if y;, and yis are both 
0. Then multinomial sampling would be used, with N series of T; observations being reformed as 
NT;(T; — 1)/2 multinomial observations of length 4. 
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We obtain’ the estimates shown in Table 6.8 for time, female gender, age and 
the therapy. These compare to the Fitzmaurice and Lipsitz estimates for the effects of 
these covariates of —0.013 (s.e. 0.022), —0.61 (0.42), —0.015 (0.018) and 1.45 (0.45). The 
predictive loss criterion (with w = 1) is 46. The high correlation replicates the work of 
Fitzmaurice and Lipsitz, who found a strong odds ratio form of dependence between 
successive patient self-ratings. 

An alternative approach is provided by the ‘one factor model’ of Heckman (1981, 
p. 130) namely 


Or = BX Ao; Ex 


where the sx are unstructured errors with variances ,, the a; are latent propensities 
for good health status with variance oł, and the à; are time-varying loadings. For 
identifiability, it is assumed that \; = 1, and pọ, = @ = 1. The variance o2 is defined by a 
U(0, 1) prior on 


t=07/[02 4] 
Defining the ratios 
R=Noz/DMez 41 


of ‘permanent’ to total variance’, the correlation between disturbances at times ¢ and s is 
then C,, = AAs, where 


Ay = R” 


N(1, 1) priors are assumed on the \,(¢ > 1), and constrained to being positive. 

Fitting this model? enhances the mean therapy effect to 2.2, with 95% interval 
{0.75, 4.1}, but shows weaker effects than in Table 6.8 for gender or age. The correl- 
ations range between 0.47 (between times 3 and 4) and 0.74 (between times 1 and 2). The 
individual propensities a; range from 3.3 for patient 4, who records good health status 
at all time points, despite being on the placebo treatment, to patient 32 with a score of 
—3.8 and classed as poor status at all points. The predictive loss criterion stands at 45, 
showing slight gain to using this model. Subject to identifiability more complex one 
factor models might improve fit, for example taking , variable by time. 

Finally, the method of Albert and Chib (1993) offers a direct approach to sampl- 
ing the underlying latent variables in Equation (6.17). It is applied to the model of 


Table 6.8 Arthritis status, differential spacing model, 
parameter summary 


Mean St. devn. 2.50% 97.50% 


Intercept 1.86 1.68 —1.5 5.17 
Time 0.022 0.05 —0.075 0.12 
Female —1.14 0.84 —2.82 0.49 
Age —0.017 0.034 —0.085 0.052 
Therapy 1.82 0.79 0.39 3.48 
y 0.93 0.03 0.86 0.97 


7 A three chain run shows convergence at 1000 iterations and summaries are based on iterations 1000-5000. 
3 The composite error Aya; multiplies the permanent subject error by a time varying factor. 
°? A two chain run is taken to 5000 iterations with 1000 burn-in. 
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Equation (6.23) but might equally be applied to the one factor model. In the presence of 
missing observations, one option is to use imputations of the missing binary responses 
Yiu, SO permitting one to define the sampling ranges for the latent continuous 
indicators Z;;. For illustration a single drawing of the missing Yj; is used (at 5000 
iterations in the run for Model A). A more complete multiple imputation procedure 
would pool over several analyses with different imputations. An alternative approach if 
there are missing data points is to use unconstrained sampling of the Z for those points. 
One might then estimate probabilities such as Pr(Z;, > 0) = Pr( Yi, = 1) for the missing 
observations. 

Here the latent sampling approach is applied with Student ¢(8) errors, so approxi- 
mating the logit link. This entails defining gamma variables »;, with density G(4, 4) and 
then sampling (within the relevant constraints) the Z; with means 0; = BX; ej and 
precisions ¢,,. One might also take the precisions at subject level ọ;, and not observation 
level. Model fit may be assessed by sampling new Z values without upper and lower 
constraints, and then classifying new Y values according to whether the new Z exceed 0; 
on this basis, the predictive criteria of Gelfand and Ghosh (1998) and others may then 
be derived, comparing actual and new Y. 

Under the imputation approach”®, the correlation under differential spacing is 0.91 
(see Table 6.9). This approach shows a lower precision on the therapy effect (though it is 
still judged beneficial), and the gender effect is amplified as compared with the usual 
logit model in Equation (6.23). The predictive loss criterion under this model is 39.4. 
Under unrestricted sampling'! of Z;, when there are missing Y; (Model D in Program 
6.6), the posterior mean for y is 0.89 and the treatment effect averages 2.1 with 95% 
interval (0.4, 4). For predicting missing values under this approach the model for Zi 
could well be improved, for instance using autoregression on previous values of Z; or of 
Yx, particularly since the first occurrence of missing Y is at t = 3. 


Example 6.7 Patent applications As an illustration of generalised Poisson regression 
models for panel data on counts yi, we follow Hausman, Hall and Griliches (1984) and 
Cameron and Trivedi (1999) in considering patent applications by technology firms. 
Trends in patent activity may be partly explained in terms of levels of current and past 
research inputs, type of firm, and time t itself. However, unobserved variation is likely 
to remain between firms in terms of factors such as entrepreneurial and technical skills — 
suggesting the need for a permanent firm effect. Autocorrelation in residuals is reported 
by Hausman et al. in a simpler model without such unobserved factors included, and by 


Table 6.9 Latent student ¢ model, posterior summary 


Mean St. devn. 2.5% 97.5% 
Intercept 3.61 2.15 —0.17 8.00 
Time —0.013 0.059 —0.126 0.104 
Female —1.48 1.12 —4.00 0.58 
Age —0.04 0.04 —0.12 0.03 
Therapy 2.38 1.00 0.43 4.39 
y 0.91 0.03 0.85 0.95 


10 Convergence with three chains is apparent at 3000 iterations (with delay in the age effect converging); 
summaries are based on iterations 3000-10 000. 
11 The sampling intervals for Z are (—20, 0) when Y is 0, (0, 20) when Y is 1, and (— 20, 20) when Y is missing. 
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modelling this type of heterogeneity the need for a serial correlation model for the errors 
may be avoided. 


Hausman et al. adopt a multiplicative conjugate gamma form for the heterogeneity, 
such that 


vir ~ Poi(vipiz) (6.24a) 
v; ~ G, 8) (6.24b) 
log (Wiz) = Bxit (6.24c) 


For this specification, the marginal means and variances are of negative binomial form 
with Eu) = py and Vivi) = wy = Var(v) ei = bi H7,/8. 

The analysis here follows Hausman et al. in adopting a random effects specification 
for firms, in addition to a regression model including time ¢ and logs of current and 
lagged research expenditures Ro, R_;,...R_s5. Note that the original inputs have been 
centred in Program 6.7 to improve convergence. As noted by Cameron and Trivedi 
(1998, p. 286), the coefficients on log (R_;) are elasticities, so that the coefficients on 
log Ro, log R_1,.. log Rs, would sum to unity if a 50% increase in patents filed follows 
a 50% increase in research spending (i.e. a proportional effect). 

The first model used here has the form has an additive (log-normal) form for the 
extra-variation between firms. 


log (Wiz) = BXir a; 
Vit ~ Poilu) 
Q; ~ N(0, T) 


where a; are interpretable as permanent firm effects. The covariates are the research lags 
and time itself. The estimated!? sum of the research elasticities (Table 6.10) comes close 
to Table III of Hausman et al. The return to research spending in patent output is not 
proportional, though higher than reported by Cameron and Trivedi (1998, p. 286). 

The second model follows Hausman et al. in taking a conjugate gamma-Poisson 
model (Model B in Program 6.7), as in Equation (6.24), with 


v; ~ G(8, 8) 
è~ Gil, 1) 


where sensitivity to the prior for 6 might involve gamma priors G(g, g) with alternative 
values g. The log (v;) are analogous to the a; of the Poisson log-normal model. Note that 
a diffuse prior on 6 (e.g. with g = 0.001) may lead to numerical problems. This 
approach gives broadly similar results to the preceding model, and also shows a less 
than proportional research input gain. The DIC and predictive loss criterion are both 
slightly improved under the gamma-Poisson model; the predictive loss (with w = 1) falls 
from 204 500 to 202 400, and the DIC from 9760 to 9650. 

To generalise to bivariate effects, at firm level the Poisson log-normal model may be 
preferable. Following Chib et al. (1998), the slope on log Ro is taken to be variable over 
firms, together with the intercept. Rather than assuming zero means for these param- 
eters (denoted a, 1:2), and retaining separate ‘fixed effects’ for the intercept and the 


12 Tn the first and second models, three chain runs show convergence at around 3000 iterations and summaries 
are based on iterations 3000-5000. 
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Table 6.10 Models for firm heterogeneity: regression parameters and estimated firm 
effects (Firms 1—4) 


Mean St. devn. 2.5% Median 97.5% 


Log-Normal Heterogeneity 


Intercept 1.02 0.07 0.89 1.02 1.16 
Sum of Coefficients on Rj 0.80 0.02 0.75 0.79 0.85 


Coefficient on 


Ro 0.49 0.04 0.41 0.49 0.57 
Ri —0.03 0.05 —0.12 —0.03 0.06 
R2 0.10 0.04 0.02 0.10 0.18 
R3 0.12 0.04 0.04 0.12 0.20 
R4 0.04 0.04 —0.03 0.04 0.11 
Rs 0.08 0.03 0.02 0.08 0.14 
Time —0.065 0.003 —0.071 —0.065  —0.059 


Effects for firms 1—4 


ar 2.34 0.08 2.18 2.34 2.50 
az 0.17 0.35 —0.55 0.19 0.81 
a3 0.36 0.10 0.16 0.36 0.54 
a4 —1.86 0.53 —2.96 —1.82 —0.91 


Gamma-Poisson heterogeneity 


òi 1.10 0.11 0.92 1.10 1.33 
Sum of Coefficients on Rj 0.76 0.06 0.67 0.75 0.84 
Intercept 2.46 0.06 2.35 2.46 2.58 


Coefficient on 


Ro 0.48 0.14 0.28 0.50 0.68 
Ri —0.07 0.10 —0.23 —0.08 0.12 
R2 0.11 0.05 0.03 0.08 0.20 
R3 0.12 0.05 0.05 0.11 0.22 
R4 0.03 0.05 —0.04 0.01 0.15 
Rs 0.08 0.06 —0.02 0.08 0.17 
Time —0.064 0.005 —0.072  —0.064 —0.054 


Effects for firms 1-4 (logs of gamma effects) 


log(v1) 1.90 0.07 1.77 1.90 2.04 
log(v2) —0.19 0.37 —0.96 —0.17 0.50 
log(v3) —0.03 0.13 —0.25 —0.04 0.22 
log(v4) —2.75 0.83 —4.68 —2.62 —1.48 


coefficient on log Ro, it may be preferable for MCMC identifiability and convergence to 
take aj, 1.2 to be bivariate with mean n, where n; is equivalent to the fixed effect 
intercept By and n, to the fixed effect B,. So 


log(wi) =a anlog Ri, Bylog(Ri,i-1)  .-Belog(Ri, 5) 
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To further improve identifiability an N(1, 1) prior on n, is adopted, and N(0, 1) priors 
on Bz, B3, Ba, Bs and Be. This model yields’? an improved fit with DIC at 9105 and 
predictive loss criterion at 147000. Its improvement occurs especially for firms with 
low (e.g. runs of zeroes) patent counts such as 13 and 21, and firms with high patent 
activity, such as firms 45 and 242. Under this model, the coefficient on log Rọ increases 
to around 0.64, with the sum of elasticities around 0.89. It is interesting to note a 
correlation of —0.76 between the firm specific slopes and intercepts. 

None of the above models directly approaches the issue of serial correlation in the 
errors, though this is modest in the bivariate firm effects model'*. A fourth model 
(Model D in Program 6.7) therefore assesses autocorrelation under the Poisson log- 
normal model, specifically a lag 1 dependence in the errors. There are various options 
here, with a common choice being based on stationarity 


log (Ma) = BXir Qi en (6.25a) 
en ~ N(yei, 1, 0°) t>1 (6.25b) 
en ~ NO, o7/(1 -y (6.25c) 


However, in the short runs (small T) typical of panel data, a non-stationary model is 
feasible, with the variance of the first error a separate parameter not linked to o”, and y 
following a prior such as N(0, 1). A transformation within the link for u, can also be 
made, so that for t > 1 


log (Win) = Bxir oF yLlog (m; 1) — Bx, 1 — Ai] Uir 


where uj; ~ N(0, o°). One might also assume a lag 1 correlation in an estimate of the 
regression errors based on comparing actual and predicted counts. This approach uses 
the transformation (Cameron and Trivedi, 1998) 


Yi=Vin A,h>O 
so that 


log (Wa) = Bxir a; yog, 1 — BXi, 1 — a) 


with A either set to a small constant or forming an extra parameter. 

Another issue with including observation level heterogeneity is that the ey 
may effectively model the observations, making the regression terms in effect redun- 
dant. Unlike for binary panel data, it is not necessary to set o? to 1. However, it may 
be advisable to set a prior favouring relatively high precision in the ex (for a discus- 
sion of similar questions in modelling binomial heterogeneity, see Valen and Albert, 
1998). Also, a high lag 1 correlation may make separate identification of e; and a; 
difficult. One may also adopt the strategy of linking these two effects, as discussed 
above. 

Exploratory analysis with model (6.25) suggested that there was a high lag coefficient 
close to unity, whether non-stationary or stationary priors on y were used. To improve 
identifiability, the permanent firm effects a; are therefore excluded. Also, a relatively 


13 A two chain run shows convergence at around 5000 iterations and inferences are based on iterations 5000- 
7500. 

14 Defining errors by firm and period as (yi, — Min)/ Mi gives a correlation between errors at lag 1 and at lag 2 in 
this model of 0.12, compared to 0.29 in the Poisson-gamma model. 
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informative G(5, 1) prior on 1/o? was taken together with N(1, 1) priors on B4 (the 
coefficient on log Ro) and N(0, 1) priors on B», B3, B4, Bs and Be- 

Under a stationary prior for y in the model (6.25), a two chain run (with convergence 
from iteration 5000 in a total of 7500) shows a sum of elasticities more in line with 
proportional returns, in contrast to the models without temporal error correlation 
(Table 6.11). The posterior mean for the sum of elasticities is 0.9, though the 95% 
upper point does not exceed 1. The posterior mean for 1/o7 is 21.5 compared to the 
prior mean of 5 so that the prior does not seem incompatible with the data. The 95% 
credible interval (0.97, 0.98) for y comes close to a random walk. The DIC and 
predictive loss fit measures both show a gain from using this model; the DIC is 8590 
compared to 9650 under the Poisson-gamma model for permanent firm effects. 

A variety of other modelling approaches may be applied to these data, including 
variable selection methods (e.g. Kuo and Mallick, 1998), since some of the lag coeffi- 
cients may be unnecessary, or priors specifically adapted to distributed lags (Shiller, 
1974). 


6.4 PANELS FOR FORECASTING 


Panel data in econometrics, demography and biometrics reinforce information on trend 
by repetition over subjects. Such subjects vary by setting, and might be patients or 
pupils (growth curves), regions or firms (differences in productivity growth), or age 
groups (demographic schedules through time). While evaluation of time series models 
usually involves cross-validatory assessment in time (e.g. via one step ahead forecasts), a 
broader set of checks are possible in panel data analysis. Suppose there are N subjects 
observed for T times. Granger and Huang (1997) consider out-of-sample prediction 
in econometric panel models, and provide a useful framework distinguishing between 
out-of-sample predictions (for subjects or firms N 1,N 2,..N M not in the 
sample), post-sample predictions (for times T 1,7 2,... beyond the sample) and 
predictions both post and out-of-sample (for both times t= 1, T 1,..and subjects 
i=N 1,N 2,..). 


Table 6.11 Temporal correlation model, parameter summary 


Mean St. devn. 2.5% 97.5% 

Sum of Coefficients on Rj 0.90 0.03 0.84 0.95 
Coefficient on 

Ro 0.44 0.03 0.36 0.51 
Ry 0.001 0.07 —0.10 0.13 
Ro 0.11 0.05 0.00 0.21 
R3 0.12 0.05 0.04 0.22 
R4 0.14 0.04 0.06 0.22 
R5 0.09 0.04 0.02 0.16 
Time —0.06 0.01 —0.07 —0.05 


Error correlation 


y 0.98 0.003 0.97 0.98 
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As a particular frequent application, growth curves raise similar issues of prediction 
both beyond the observed time points, and also in terms of incomplete trajectories for 
subsets of subjects. We might also have observed cohort c for T time points and want to 
use their growth profile parameters to predict the future growth of a younger cohort 
c 1 observed for fewer than T times. In the balanced growth curve case where equal 
numbers of times are observed for each subject, there are observations for N subjects at 
T times points vj, v2, v3,... vr, forming a T x N matrix Y. Let X be a T x m design 
matrix, where m is the degree of the growth curve polynomial. Thus for linear growth, 
m = 2, and 


In the random effects case, 
Ya = Ba Bor Er 


the intercepts B, and linear growth terms Bp vary over subjects i, for example via a 
bivariate density, and the e may be unstructured or correlated. 

Other common options involve the growth parameters B being taken as homogenous 
over G subgroups of the subjects, for example defined by gender or assigned treatment. 
Let A be an assignment matrix of order G x N, with Aj = 1 if subject i is in group j. For 
instance, if there were gender subgroups of size N; and N2, then A contains N; columns 
with entries 

1 

0 
and N> columns of entries 

0 

1. 

Then B will be of dimension m x G, and for m = 2 would contain group specific 
intercepts and linear growth parameters. The growth curve model in this case may then 
be specified as 


Y=XBA y (6.26) 


where n is a T x N error matrix. 

The specification of n may be important if the goal is extended prediction for all N 
subjects to new times vr 1, vr 2, etc. outside the current growth curve. Other types of 
prediction involve projecting incomplete trajectories for certain subjects observed only 
for S < T time points. 

Thus, an unstructured error matrix with 


cov(n; 1:7) = >> 


might be adequate to describe a growth pattern (e.g. a clinical trial of known duration), 
where extrapolation beyond the trial is not an issue. However, structured modelling of 
the n; may be required if post-sample prediction is the goal (or autoregression in the Y 
themselves). Thus one might adopt an AR(1) structure with 


Ne = VN, 1 Ue j>l 
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where u; are white noise errors with mean 0 and variance o?. For a growth curve with 
m = 2 and homogenous effects within subgroups, the model for the first time point in 
the AR(1) model has mean defined by the group index a;e{1,.. G} 


Ma = Bija] Boja; Y1 Nit 


where n; has variance o?/(1 — y’). 

The question of optimal prediction then occurs. Standard discrepancy measures 
comparing predicted and actual trajectories may be used if actual values are known. 
Thus, Lee and Geisser (1996) consider extending the sample re-use or predictive cross- 
validation method which for cross-sectional data involves omitting each case at a time 
and predicting it on the basis of the remaining N — 1 cases. The ‘re-use’ occurs because 
each observation is used N — | times in separate models. In the panel data context, with 
T repeated observations for the N subjects, a sample re-use procedure might be used to 
predict the final (i.e Tth) observation in the series for a single subject i on the basis of 
that subject’s prior series data, and the full series for the remaining N — 1 subjects. This 
procedure would be repeated N times. 


Example 6.8 Dental development data We consider the dental data from Potthof 
and Roy (1964) relating to 27 subjects, 11 girls (a; = 1) and 16 boys (a; = 2). In the 
full data set there are T = 4 observations on each subject, these being distances in mm 
from the centre of the pituitary to the pteryomaxillary fissure at two yearly intervals (so 
that centred times are vı = —3, v2 = —1, v3 = 1, v4 = 3). Following Lee and Geisser 
(1996), observation 20 is omitted as an outlier. Further the original data are modified so 
that the last female vector has its fourth observation, namely 28, missing: so T\; = 3, 
and the first three observations for that subject are (24.5, 25, 28). Also the first male 
vector (with first three observations 26, 25 and 29) is regarded as having the fourth 
observation, namely 31, missing. 


Prediction of the missing data is carried out under three models: first, a linear growth 
curve model with gender specific intercepts and slopes (cf. Equation (6.26)) and uncor- 
related errors. Thus, 


Yin = Bi, Qj Bo, a; ¥t Eir 


with ¢ multivariate Normal with mean 0 and dispersion matrix $. The second is a 
growth curve model with both unstructured univariate Normal errors ¢ and autocorre- 
lated errors e; (lag of order 1). Thus, for ¢ > 1 


Yr = By, ai Bo, a;Vt Cit Eit 


where e; = Yei, ;-1 Uin Where the u are unstructured. An N(0,1) prior is assumed for y. 
The third model involves a first order lag autoregression in the observations themselves. 
The means for the third model are of the form (for t > 1) 


Bit = Ba;  Pa,Vi, -1 
For the first period alternative means are assumed, with 
Mi = BF 
Another model for t = 1 might retain the same intercept and model the latent yj in 


bia = Ba, Pa,Vi0 
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Table 6.12 Dental development data, alternative structures for data and 
errors, model parameters and predictions of missing data points 


1. Unstructured errors 


Correlations for times Mean St. devn. 2.5% 97.5% 
@ 1,9 

(2, 1) 0.61 0.12 0.33 0.81 
(3, 1) 0.71 0.10 0.47 0.87 
(3, 2) 0.83 0.06 0.68 0.92 
(4, 1) 0.49 0.15 0.16 0.74 
(4, 2) 0.71 0.10 0.47 0.87 
(4, 3) 0.84 0.07 0.67 0.93 
Intercepts 

Bi 22.69 0.53 21.65 23.75 
Bio 25.31 0.51 24.31 26.31 


Growth Rates 


Boy 0.47 0.10 0.27 0.67 
Boo 0.78 0.10 0.60 0.97 
Predictions 

Predicted Yu,4 27.97 1.26 25.51 30.48 
Predicted Yio, 4 29.71 1.23 27.24 32.13 
MAD 1.24 0.65 0.23 2.70 


2. Structured & unstructured errors 


Intercepts Mean St. devn. 2.5% 97.5% 
Bi 22.67 0.58 21.44 23.76 
Bio 25.27 0.51 24.32 26.39 


Growth Rates 


Boy 0.46 0.12 0.19 0.68 
Bo» 0.77 0.11 0.57 0.97 
Predictions 

Predicted yj), 4 27.77 1.43 24.78 30.44 
Predicted yy» 4 29.91 1.41 27.09 32.69 
MAD 1.29 0.69 0.29 2.97 
Autocorrelation, y 0.77 0.07 0.63 0.91 


3. Autoregression in observations (gender specific) 


Intercepts Mean St. devn. 2.5% 97.5% 
By 2.96 1.65 0.93 7.18 
B2 6.55 2.20 2.57 11.25 
7 21.20 0.72 19.77 22.64 
2 22.87 0.61 21.69 24.10 


(continues) 
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Table 6.12 (continued) 


Autoregressive parameters 


Ppi 0.91 0.07 0.72 1.00 
P2 0.80 0.09 0.60 0.97 
Predictions 

Predicted Yil, 4 28.42 1.32 25.71 30.88 
Predicted Vi, 4 29.72 1.32 27.10 32.22 


MAD 1.30 0.69 0.23 2.87 


as being from a more robust density than the observed y; (e.g. multivariate ¢ with low 
degrees of freedom, but the same mean and dispersion as in the multivariate Normal for 
the observed yj;). 

The mean absolute deviation is used as a discrepancy measure and obtained by 
averaging over the two missing data points. There are relatively slight differences in 
predictive success under these different models (Table 6.12). The growth curve with 
unstructured errors!> gives a posterior mean of 28 for the fourth observation of the last 
female and 29.7 for the fourth observation of the first male, with average MAD of 1.24. 
Allowing autocorrelated errors changes forecasts little, namely 27.8 and 29.9 for the 
missing female and male data points respectively. The growth rate B., is higher for 
males (k = 2) than females (k = 1) under both growth curve models, and the autore- 
gressive parameter in the third model is correspondingly higher for males. The third 
model yields a higher autoregressive parameter for females and slightly higher predic- 
tion of vita. 


6.4.1 Demographic data by age and time period 


Profiles of demographic outcomes by age, marital status, ethnicity, etc. may be observed 
repeatedly over time, and this forms the basis for projecting to future times. The analysis 
of such data often also focuses on smoothing observed schedules of mortality, marriage 
rates, fertility, over the demographic categories, for example to take account of regular- 
ities in the relationship between adjacent rates, especially age specific rates. Hickman 
and Miller (1981) consider in particular the projection of exit rates for populations of 
insured lives or properties, and of annuitants and pensioners. The goal is estimation of 
probabilities of remaining in or leaving these populations because of death, withdrawal, 
migration, etc. These probabilities in turn figure in cash flow projections of insurance 
companies. Similar issues arise in demographic projections, for example in projecting 
population mortality rates. 

Hickman and Miller consider models allowing both for correlations between mortal- 
ity at different ages and for correlations over calendar time (i.e. the usual temporal 
autocorrelation). They present totals of female annuitant deaths Dx: by ages k, namely 
51-55, 56-60, .. 86-90, and 91-95 and for years t, namely 1953, 1958, 1963 and 1968 
(where k = 1,..K and t= 1,.. T, with K = 9 and T = 4). These deaths occur among 
total exposures (total time exposed to risk in the relevant population) Er. They consider 
a Normal likelihood approximation based on square root transforms of the observed 


15 Convergence in modell is obtained at under 500 iterations in two chain runs of 20000, and summaries are 
based on iterations 500-20 000. 
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crude death rates mp, = Dkt / Ekt. The square root transforms vy; = (met are taken to 
be normal around means vk, = (uz? , with variances 1/(4E;;). 

The model means vx are in turn? assumed multinormal with means p, and with 
covariation among errors determined by correlations across both ages and times. The 
correlations between ages i, j = 1,.. K are described by a matrix 


Ay = HL L3’) 


where L; are prior exposures, in effect prior sample sizes based on accumulated actuarial 
experience with such populations, and assumed known. The correlations between times 
s, t= 1,.. T are described by a matrix 


Cyt = pil 


Then the prior covariance matrix of the vector p,;, (of length K.T = 36) is the Kro- 
necker product G = C*A. Hickman and Miller assume both p; and p, to be known, 
though they could be assigned priors, e.g. on the (0, 1) interval. 


Example 6.9 Annuitant deaths A similar structure to that used by Hickman and 
Miller is illustrated here except in assuming a Poisson likelihood with 


log (Bin) =& Ne 


where the n;, have the above described multinormal structure. Autocorrelation in age 
and time separately is also considered, in which 


log (Me) =O Ek Ex (6.27) 


and both errors are governed by lag 1 dependence. 

Hickman and Miller define a structured covariance for the n, determined by only 
three parameters, the overall variance and the two correlations. To illustrate the fully 
parameterised multinormal model, Model A includes a 36 x 36 dispersion matrix for 
the interdependence of age and time errors. This is assigned a Wishart prior with 36 
degrees of freedom. This model has 24.5 effective parameters and a DIC of 58.5, and 
closely reproduces the data with a deviance at the mode of 10. Model B follows Hick- 
man and Miller (1981) in taking a reduced parameterisation for the multinormal with p, 
and p, assumed known (with values 0.9 and 0.5, respectively). These values are based on 
extensive sensitivity analysis by them with the annuitant data. To allow for remaining 
variation in this model a simple unstructured error uj; ~ N(0, o?) is included, so that 


log (Wx) =a Nkr Ukt 


where 1/o? ~ G(1, 0.001). This model gives a slightly improved fit compared to the 
unstructured dispersion model, with DIC of 56. One may set priors for p; and p, (rather 
than assuming them preset), but direct sampling in WINBUGS is very slow. One 
possibility is to precalculate covariance matrices G over a grid of pairs of values of p, 
and p, input them as data, and then use a discrete prior to find the combination 
supported by the data. 

A conditional error model is easier to implement in WINBUGS, as in Equation 
(6.27), with 


16 Hickman and Miller actually consider the transformed outcome (1000p,,)°>. 
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2 
Ein ~ N(P1E1k-1; T1) 
Ez ~ N(Pp2E21-1, 73) 


The posterior means of p; and p, (and 95% credible intervals) are, respectively, 0.88 
(0.83, 0.93) and 0.77 (0.05, 0.99), with the density for p, negatively skewed. The DIC of 
46.5 (with 10.5 effective parameters) improves on the multinormal models. Table 6.14 
presents posterior means of the square root of death rates in relation to populations in 
thousands, namely 


bkt = (er X 1000)°° 


Table 6.15 Posterior means for graduated value of square root of death rates per 1000 
population (second half of run of 20000 iterations over two chains), by, for ages k and years t 


Age Year Deaths Exposed lives Empirical rate Smoothed rate St. devn. 


51-55 1953 0 171 0.00 2.11 0.29 
1958 2 214 3.06 2.02 0.28 
1963 3 328 3.02 2.08 0.28 
1968 3 439 2.61 1.97 0.27 
56-60 1953 15 1371 3.31 2.78 0.16 
1958 8 1874 2.07 2.66 0.15 
1963 20 2879 2.64 219 0.15 
1968 28 3597 2.79 2.59 0.14 
61-65 1953 63 4899 3.59 3.49 0.11 
1958 87 7939 3.31 3.34 0.09 
1963 132 11230 3.43 3.43 0.09 
1968 174 16530 3.24 3.26 0.08 
66-70 1953 111 6596 4.10 4.30 0.11 
1958 235 14463 4.03 4.12 0.09 
1963 430 22500 4.37 4.24 0.08 
1968 529 33360 3.98 4.02 0.06 
71-75 1953 69 2414 5:35 5.59 0.15 
1958 180 6451 5.28 5.35 0.12 
1963 407 13668 5.46 5.50 0.10 
1968 611 22109 5.26 5.21 0.08 
76-80 1953 69 925 8.64 7.89 0.22 
1958 115 2029 7.53 7.56 0.17 
1963 340 5387 7.94 7.76 0.14 
1968 611 11689 7.23 7.36 0.12 
81-85 1953 35 269 11.41 10.30 0.32 
1958 59 631 9.67 9.87 0.26 
1963 158 1448 10.45 10.14 0.24 
1968 351 3941 9.44 9.62 0.20 
86-90 1953 9 62 12.05 13.37 0.52 
1958 23 130 13.30 12.81 0.45 
1963 59 363 12.75 13.15 0.43 
1968 130 802 12.73 12.48 0.39 
91-95 1953 2 10 14.14 16.91 1.02 
1958 7 24 17.08 16.20 0.95 
1963 13 62 14.48 16.64 0.95 


1968 48 149 17.95 15.79 0.88 
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These are reasonably close to those cited by Hickman and Miller (1981, Table 11). Rates 
under this model are smoothed towards the average for the age group, though trends are 
still in line with those in the crude rates. For instance, for ages 76-80, the smoothed rates 
(like the crude ratres) show a deterioration between 1958 and 1963, as compared to 
improvement between the first two years. 

The final model (Model D in Program 6.9) is as just described but estimated using 
observations over three years, and a prediction made for 1968. For these data the 
posterior means of p; and p, (and 95% credible intervals) are, respectively, 0.87 
(0.82, 0.91) and 0.65 (0.04, 0.99). The credible intervals of the predictions for 1968 
include the actual values, except for ages 91-95, which are under-predicted. 

Various generalisations of these models can be developed, either for the multinormal 
(simultaneous autocorrelation) or the conditional error model. For example, one might 
make the age correlation specific to periods, so that 


log (My) =O Eik Ex 
Elkt ~ N(p1,£1, k-1, t> T) 


Random walk autoregressive priors of order 2 are investigated by Berzuini and Clayton 
(1994), so that one might have 


2 
Eix ~ N(2e1, k-1 — £1, k-2, Ty) 


6.5 MISSING DATA IN LONGITUDINAL STUDIES 


A practical feature of many surveys or trials is nonresponse, either unit nonresponse 
with failure to obtain any responses from certain members of the sampled population, 
or item nonresponse to individual questions from unit-level respondents. This raises 
implications of nonresponse bias, for example in connection with sub-populations with 
known nonresponse problems in surveys (e.g. low income minorities). Such problems 
occur in periodic longitudinal surveys where a separate sample of the population is 
chosen each time the survey is carried out. Panel surveys (following up on a cohort of 
subjects not open to recruitment) are additionally subject to an attrition effect as unit 
non-response is cumulative (Winer, 1983). Similarly, in clinical trials or other studies 
with a cohort design, following up treatment outcomes or quality of life measures, 
patients may drop out of the study because of deteriorating health, poor quality of life, 
or death. Such drop-out is ‘informative’, inducing a dependence between the variable 
being measured and the drop-out mechanism. 

Among approaches to this problem are the development of sample weights and 
adjustments for non-response and survey attrition, and the use of imputation methods. 
Explicit models for non-response rely on observed responses: often non-response on one 
or more items occurs for some subjects, but complete response on certain items or 
survey design variables are available for all sample members. Suppose X; is observed 
for all units in a study, but X> is not observed for everyone. Then models or imputation 
methods proceed on assumptions about the mechanism relating X; and X2. The 
assumption of ignorability is that respondents and non-respondents (on X2) with 
the same X; differ only randomly on X2. A non-ignorable model by contrast allows 
for the possibility of systematic differences in X> for a responder and a non-responder 
having the same value of X4. 
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Let R = {Rj} indicate whether response was made (Rj = 0) or missing (Ry; = 1) for 
subjects i= 1,..N at times t= 1,.. T and items j = 1,..k, and that X denotes covari- 
ates fully measured for all respondents (e.g. survey design variables such as geographic 
area of sampling). The most general form of joint distribution of the indicators R and 
the outcomes, known and missing, Y = { Yobs, Ymis}, has the form 


S(R, Y|9r, Oy, X) =fy(V|X, 0y) ARIY, X, OR) 


where f, is the sampling density of the data and f, is density for the response mechanism. 
This form allows the response mechanism to be influenced by the outcomes Y, whether 
observed or missing. 

A less general assumption is that of ‘missingness at random’ if 

SARI Yoos, Ymis, X, OR) = f(R| Yobs, X, OR) 
In this case, non-response can depend upon the observed outcomes on other items, or 
on observed responses by other subjects, but given these, it will not depend upon 
the missing item responses themselves. If the MAR assumption holds and models 
for the outcome and response are separate (i.e. Oy and Og are non-overlapping), then 
the missingness pattern is called ignorable (Rubin, 1976). So missingness on an item is 
non-ignorable if it depends upon the missing value of that outcome. 

In panel studies, a particular definition of non-response occurs where drop-out 
(without subsequent re-entry to observation) is selective. In the selection model (Diggle 
and Kenward, 1994; Heckman, 1976), the joint density of Y and R is expressed as above 
in terms of the marginal density of Y and the conditional density of R given Y. Drop- 
outs at time ¢ are then random if f(R,|Y) =/(R,|%,... ¥;-1), but if related to the 
current outcome, known or otherwise, are informative. In practice, assuming random 
drop-out means relating Pr( Rj; = 1) to previous binary outcomes Yj1,.. Yj, 1, but not 
to the current outcome itself. An alternative pattern mixture model, proposed by Little 
(1993), involves expressing the joint density of Y and R as the marginal density of R and 
the conditional density of Y given R, namely 


F(R, Y) = fi(R) HOIR) 


Example 6.10 Schizophrenia treatments To illustrate modelling informative as against 
random drop-out, this example considers patient withdrawals in a longitudinal trial 
comparing treatments for schizophrenia (Diggle, 1998), where the issue is permanent 
attrition by a large number of patients (as against intermittent missing values). 517 
patients were randomly allocated either to a control (placebo) group or to treatments 
involving one of two drugs, with different dosages on one : haloperidol at 20mg and 
risperidone at 2, 6, 10 and 16mg. There are therefore six possible regimes, including 
the placebo. Measures were obtained at seven time points (at selection into trial, at 
baseline, and at weeks 1, 2, 4, 6 and 8 thereafter). Let v, denote the number of weeks at 
these time points, with the baseline defined by v = 0, and selection for the trial 
happening at vı = —1. The outcome measure was the PANSS scale (Positive and Nega- 
tive Symptom Scale), a continuous measure of psychiatric illness, with higher scores 
denoting more severe illness. Let G; denote the treatment group of patients i = 1,..N. All 
patients on risperidone are combined into one category, so the groups are then | = halo- 
peridol, 2 = placebo, and 3 = risperidone. 


The cumulative level of attrition is only 0.6% at the second round of observation 
(when v = 0), but reaches 1.7%, 13.5%, 23.6% and 39.7% in successive waves, and 
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stands at 48.5% in the final wave. The drop out rate peaks therefore at the sixth week. 
The question is whether attrition is related to health status: if there is a positive impact 
of PANSS scores on the probability of drop out, then the observed time paths of 
PANSS scores may be distorted. The observed means (not including the missing scores 
due to drop out) are, respectively, 87.2, 92.5, 86.1, 80.9, 78.5, 76.1 and 73.2. It may be 
of interest to estimate the wave means when the informative nature of the non-response 
mechanism is allowed for, and wave means include estimates of the scores for the 
drop outs. 

Suppose the Rj, are defined as above, where there is just one item to consider, 
so we can write this indicator as R. Response at time s will be ignorable if it 
depends upon observed covariates or previous observed values of PANSS score Y; 
(t= 1,..s— 1), but not if it depends on the current, possibly missing, score Yj;. There 
are two options on the definition of the likelihood. One may either consider all missing 
data points, so that for a drop out at the fourth wave, the response model likelihood 
would include the data R;, = 1 for t = 4,..,7 and the outcome model would include the 
missing data Y;,, for t>4. Alternatively, one might limit the analysis to the first 
instance when R; = 1, so the likelihood extends to T; 1 for subjects who drop out 
at occasion T; (when T; < 7) and is T; = 7 for subjects who stay under observation 
throughout. 

The data model for patient i is taken to be normal with 


Yn ~ Nui, 1/7) t= 1,7 (6.28) 


with means defined by treatment specific linear terms in time (in weeks v), by treatment 
effects 8; and by random terms as follows: 


Bix = Mi ðc Gv YG; Ui eit (6.29) 


with 8; = 0 for identifiability. The permanent subject effects U; have mean zero and 
variance 1/t2. A quadratic in weeks is included, since the PANSS readings tend to rise 
between selection (week —1) and baseline (week 0) before falling thereafter. 

Apart from unstructured measurement errors, implicit in Equation (6.28), Diggle 
assumes an autoregressive dependence in the errors e;;, with covariance between e; and 
eis Specified as 

Cov(e) = 026°? 


where A = v; — v;. Here an alternative form is adopted, though with similar allowance 
for difference time spacings between readings. Thus, the e; in the first time period are 
taken as Normal 


ei ~ N(O, 1/73) 
while subsequent errors have the form 
ein ~ N(Enr, 1/ta) t= 2, T 
The means of the autoregressive errors for t > 1 are 
Ep = phe, 1 
The model for response indicators is 


Ri ~ Bern(tit) 
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with the dependence on the outcome Y in the model for m determining whether drop-out 
is random or informative. For example, a non-informative model (when the likelihood 
includes only the first non-response wave) is 
logit(t;.;)=c) &Y; 1 t>1 
gitl ) x (6.30) 
logit(m,1)=c1 ©Yi,o 
where the unknown Bernoulli outcomes Y; o are modelled in terms of an extra un- 
known ‘response rate’ at time 0, mo. By contrast, an informative model is 


logit(m, )= 61 cY; (6.31) 


since this may refer to missing scores Y. Either type of model might also include time 
itself as a predictor of missingness, since there is clearly a trend to increased attrition at 
later waves. 

With the non-informative model for response (6.30), similar results to those cited by 
Diggle (1998, p. 221) are obtained!” for the coefficients of the response model. Drop-out 
increases with PANSS score so that those remaining are increasingly ‘healthier’ than the 
true average. 

The linear time terms 01, 62 and 63 and the treatment effects both support the 
effectiveness of the new drug, though its main effect ò; in Equation (6.29) is not as clearly 
defined as the decline in PANSS scores under this treatment, shown by 03 (Table 6.16). 


Table 6.16 PANSS scores over time, non-informative drop-out, 
parameters for response and observation model 


Mean St. devn. 2.5% 97.5% 
Response model 


Cy —5.32 0.27 —5.84 —4.81 
C2 0.034 0.003 0.028 0.039 


Observation model 


Treatment 
ò 1.43 1.76 —1.95 5.02 
33 —1.72 1.40 —4.29 1.49 
Error correlation 
p 0.938 0.015 0.908 0.967 
Linear time 
ði 0.04 0.73 —1.29 1.48 
05 1.03 0.61 —0.11 2.25 
03 —1.86 0.33 —2.40 —1.16 
Squared time 
yi —0.066 0.099 —0.258 0.117 
Yo —0.066 0.082 —0.232 0.095 


y3 0.104 0.041 0.023 0.175 


17 A two chain run shows convergence at 1000 iterations and the summary is based on iterations 1000-5000. 
For the informative response model convergence in a 5000 iteration run is obtained at 2500 iterations. 
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The placebo group apparently experience a deterioration in outcome over time (82 > 0). 
A high correlation over time in errors of the regression model for the observed scores is 
apparent with posterior mean for p of 0.94. 

Fit may be assessed by the predictive criterion in footnote 1, where for missing Y; 
only the precision of prediction is relevant; the criterion comparing actual and predicted 
R is 328, and that comparing predicted and actual Y is 277 100. 

Introducing the current PANSS score Yy into the model for response R; makes 
the drop-out model informative, as in Equation (6.31). This response model is applied 
with the likelihood just including the first wave of non-response for those who drop 
out!®. The fit to the scores Y;, improves slightly (predictive criterion 270000), but the 
response model fit is unchanged (Table 6.17). In terms of inferences on treatment 
effectiveness, both the main treatment effect and the linear time effect for risperidone 
treatment are now significantly negative, though the time slope is less acute than under 
non-informative response. 


6.6 REVIEW 


The analysis of panel data by fully Bayesian techniques facilitates a more flexible and 
robust approach, for example in terms of assumptions about: 


Table 6.17 PANSS scores over time, informative drop-out, 
parameters for response and observation model 


Mean St. devn. 2.5% 97.5% 


Response model 


Cı —5.58 0.38 —6.28 —4.80 
C2 0.034 0.004 0.026 0.041 


Observation model 


Treatment 

ò 3.25 2.38 —1.16 8.07 
53 —4.70 2.20 —9.60 —0.15 
Correlation 

p 0.960 0.015 0.930 0.988 
Linear time 

0; 0.06 0.33 —0.61 0.68 
05 1.15 0.37 0.38 1.88 
03 —0.81 0.19 —1.16 —0.42 
Squared time 

yı —0.017 0.087 —0.186 0.158 
Yo 0.046 0.095 —0.135 0.240 
y3 0.108 0.045 0.023 0.199 


18 One might define the likelihood over all seven waves, but this requires an informative prior on c2 as for 
some subjects several waves of non-response are then involved. 
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e the nature of autoregressive dependence in error terms, allowing for an assessment 
of the stationarity assumption rather than necessarily assuming it; 

e the parametric structure (or otherwise) of error terms, both time varying and fixed 
subject errors (e.g. Hirano, 2002) 

e data augmentation to sample initial pre-series values, such as in an AR(2) model in the 
response referring implicitly to unobserved data at times 0 and —1 (Karlsson, 2002); 

e data augmentation to impute missing data values depending on an additional panel 
type model for the missingness mechanism in case the latter is informative (see 
Section 6.5). 


There are a number of techniques so far generally tackled by maximum likelihood or 
EM methods where the fully Bayesian method might be used to good effect: examples 
are INAR models for Poisson and multinomial data (Bockenholt, 1999a, 1999b) or the 
Poisson exponentially weighted moving average model (Brandt et al., 2000). Chapter 5 
illustrated how INAR models might be implemented in Bayes terms and extension to 
panel situations is relatively straightfoward. 
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EXERCISES 


1. In Example 6.1, try a bivariate Student f via a scale mixture (see Program C). How 
does the predictive loss criterion change, and which patient has the lowest weight? 
Try also setting the degrees of freedom (df) in the Bivariate Student to be an extra 
parameter. 


2. In Example 6.2, apply the DPP model, where the baseline prior has a single variance 
term, namely 6; = ọ in Equation (6.15). How does fit compare with the model 
assuming component specific variances? 


3. In Example 6.3, investigate whether an improved fit results from making all the lag 
coefficients random (with all mean lag coefficients also free parameters). 


4. In Example 6.5, try replicating the model of Hedeker and Gibbons (1994) which 
reduces the ordinal outcome to four categories: I (normal or borderline, combining 
categories 1 and 2 of the 7 category outcome); II ( mildly or moderately ill, combin- 
ing categories 3 and 4 of the 7 category outcome); III (markedly ill, category 5 of the 
7 category outcome); IV (severely or extremely ill, combining categories 6 and 7 of 
the 7 category outcome). How much does fit deteriorate through making this 
simplification? 


5. In Example 6.5, try a scale mixture version of the random effects model, with degrees 
of freedom an unknown, and with the full ordinal scale. The analysis may be sensitive 
to the prior adopted for the degrees of freedom (e.g. a uniform between 4 and 100, a 
discrete prior with a grid of values over the same range). Are there any clear outliers 
(scale factors clearly under one)? 


6. In Example 6.6, try the Albert-Chib model with Normal sampling of the Zy, i.e. 
equivalent to a probit link. Does this affect fit or other inferences as compared to the 
Student ¢(8) model? Also, try both the Student t and Normal sampling models when 
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the lag is in the Zy themselves rather than the error. How does this compare in fit to 
the autocorrelated error model? 


7. In Example 6.8, try prediction of the missing dental growth curve data using AR(1) 
error autocorrelation parameters specific to gender. Does this improve predictive 
accuracy? 


8. In Example 6.8, try estimating the alternative autoregression model suggested above 
involving a multivariate t to model the latent yj. 
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CHAPTER 7 


Models for Spatial Outcomes 
and Geographical Association 


7.1 INTRODUCTION 


The analysis of spatial data has involved recent advances in several fields, to some 
extent referring to a central core of knowledge, but showing many distinct features in 
the specialisms involved. Thus, many Bayesian applications have occurred in spatial 
epidemiology, with Lawson (2001), Lawson et al. (1999) and Elliott et al. (2000) 
providing state-of-the-art discussion. One may say that here a major element is the 
assessment of patterns of relative disease risk in terms of possible clustering, perhaps 
around environmental point sources, but also in terms of ecological regression of 
disease patterns in terms of known risk factors (Lawson, 2001, p. 5). A more long- 
standing tradition of spatial modelling has occurred in spatial econometrics, with 
Anselin (2001) and Anselin and Florax (2002) providing recent overviews, and 
with Lesage (1999) providing a review of Bayesian principles in this area. Here the 
major emphasis lies in describing behavioural relationships by regression models, 
whether the data are defined over regions and areas, or at the level of individual actors 
(house purchasers, firms, etc.) involved in spatially defined behaviours. A third major 
specialism occurs in geostatistics, where a continuous spatial framework is adopted and 
the goal is often to smooth or interpolate between observed readings (e.g. of mineral 
concentrations) at sampled locations. Providing a common thread is a central core of 
spatial statistics, exemplified by the works of Ripley (1981) and Cressie (1993). 

One by-product of developments proceeding in sometimes disparate areas is that certain 
terms (e.g. heterogeneity) may be defined differently according to field of application. On 
the other hand, there is also often considerable benefit in applying concepts across these 
areas, for example using geostatistical ideas in ecological regression or in examining 
spatial disease patterns (Oliver et al., 1992). The present chapter seeks to provide a 
selective introduction to some of the modelling issues involved in these diverse areas; it 
is inevitably selective and partial, but hopes to identify some underlying common themes. 

The contrasting concerns of the major specialisms may be illustrated by a distinction 
sometimes drawn between models focused on spatial interactions, and those concerned 
with spatial disturbances, meaning regression errors (Doreian, 1982). In spatial effect or 
spatial interaction models, the spatial pattern or space-time pattern is the main focus of 
the analysis, for instance when the realised patterns reflect causal processes. Thus 
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investigations into ‘focused’ clustering of excess mortality or illness around pollution 
point sources typically involve a hypothesis that exposure increases nearer to the source 
resulting in raised disease incidence (Section 7.6). Similarly, in spatial econometrics, 
spatial autoregressive models in the observations themselves are commonplace (e.g. see 
Anselin, 2001), for instance reflecting behavioural processes of spatial diffusion. 

However, in many regression applications in spatial epidemiology involving aggre- 
gate disease counts in areas, spatial interaction models are relevant only for certain 
outcomes, such as infectious diseases (Cliff and Ord, 1981). In these applications spatial 
dependence’ occurs because of omitted or unmeasured spatially correlated predictors, 
and so is reflected in regression errors, causing departures from the independent errors 
assumption of the conventional regression. In problems involving both space and time 
dimensions (Section 7.7), errors may be correlated in both time and space simultan- 
eously (Lagazio et al., 2001). An example of incorrect inferences when the outcomes or 
errors in a regression are spatially correlated is that such correlation reduces the amount 
of independent evidence available to model the process under investigation, and may, if 
not allowed for, lead to over-estimation of the significance of regression relationships 
(Richardson and Monfort, 2000, p. 211). 

An additional issue raised clearly by writers such as Fotheringham et al. (2000) and 
Lesage (1999) is that of spatial heterogeneity’, either in regression relationships (e.g. in 
terms of regression coefficients varying over space) or in terms of heteroscedasticity in a 
spatially unstructured error term. There may be identifiability problems in separating 
spatial dependence (e.g. correlation) from spatial heterogeneity (de Graaff et al., 2001; 
Anselin, 2001). 

Consider how one might allow for spatial correlation in regression errors (though 
heterogeneity is considered in Section 7.5). Whereas the standard linear model for 
metric outcomes would assume independence in the errors of e; and e;, with i Æj 
being different areas, alternatively suppose the errors of area i are not necessarily 
independent of the errors in other areas. Thus, let Y and e be n x 1, x be n x p and 
assume a linear regression 


Y=xß e (7.la) 


with a joint prior specifying multivariate dependence between the errors, such as an 
MVN model with 


e ~ N, (0, È) (7.1b) 


where the off-diagonal elements in the n x n covariance matrix È reflect the spatial 
patterns in the data. A range of techniques (e.g. variogram analysis) explore covariation 
in the outcomes or in regression residuals, in relation to inter-area or inter-point 
distances, so as to estimate parameters in the covariance matrix (see Section 7.4). For 
example, Cook and Pocock (1983) use a form of variogram analysis to define the 
covariance matrix 5 in terms of an overall prior variance o°, when dj; = 0 and a 
parameter y reflecting off-diagonal spatial dependence between areas: 


' Following Anselin (2001), spatial dependence may be considered as any departure from the independent 
errors model or from models excluding spatial lags in outcomes; thus in spatially dependent models, one may 
have cov(Y;, Y;) 4 0 for i and j neighbours in some sense, and/or cov(e;, e;) # 0. The existing spatial literature 
is dominated by linear models and linear spatial correlation, whereas the broader concept of dependence might 
include non-linear spatial errors or interactions (de Graaff et al., 2001). 

? In disease mapping applications, the term excess heterogeneity is often applied to spatially unstructured 
errors (for Poisson overdispersion) in the log link for count outcomes. 
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2 
Ly ae (712) 

rj =o exp (— ydj) 
where R = [rj] are correlations. The distance decay parameter y is expected to be 
positive so that outcomes or errors in neighbouring areas are more similar. Such “direct 
modelling’ of the covariance structure is also relevant for spatial interpolation in fields 
such as geostatistics, where readings of an outcome Y(s;) may be obtained in two or 
three dimensional space s; = (5};, 52;,.. ) and the model for covariation between loca- 
tions s; and s; is used to predict Y at an unobserved location. 

Alternatively, one may seek a parameterisation of > in terms of a known interaction 
scheme (Bailey and Gattrell, 1995), which may take account of contiguities between 
areas or distances dj between them (Section 7.2). This approach results in simplified 
identifiability of other aspects of the model (Anselin, 2001). The extent of spatial 
dependence (typically of a linear nature) in the interactions between Y or in the 
regression errors is modelled in terms of spatial autocorrelation between adjacent 
areas, usually expected to be positive. Spatial autoregressive models then include one 
or more unknown autocorrelation parameters p, applying either to spatial lags in the 
outcomes themselves or to the regression errors. In mapping of disease counts 
(Section 7.3), a particular modelling structure assumes a known interaction scheme, 
and also that p = 1 in the errors, and focuses on the relative variability of spatial and 
unstructured errors (e.g. Best et al., 1999; Mollié, 2000). 


7.2 SPATIAL REGRESSIONS FOR CONTINUOUS DATA WITH FIXED 
INTERACTION SCHEMES 


Here we first consider regression models where the interaction between areas is taken as 
known and the focus is on making correct inferences about regression impacts, on 
estimating different forms of spatial correlation (e.g. in the data themselves or in the 
regression errors), or on allowing for spatial heterogeneity. We first focus on 
the observed continuous outcome case, though it may be noted that the ideas transfer 
to latent continuous variable models (e.g. when outcomes are binary or ordinal), using, 
for instance, the sampling methods of Albert and Chib (1993). 

Thus, consider an n x n matrix C of contiguity dummies, with cj = 1 if areas i and j 
are adjacent, and cj = 0 otherwise (with c; = 0). Alternatively, a distance based inter- 
action scheme might involve elements such as cj = 1/dj(i #7) or cj = 1 / de, but again 
with c; = 0. Then scale the elements to sum to unity in rows, with W as the scaled 
matrix, 


W = [wy] = [ey/ X. cal 
J 


Given the known form of C and hence W, a model for spatial dependence in the errors 
for a metric outcome Y might then take the form 
Y=x e 
f (7.3a) 
e=pWe u 


where p is an unknown correlation parameter, where Y, e and u are vectors of length n, 
and x is of dimension n x p. Here the u denote spatially unstructured errors, which are 
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frequently taken as homoscedastic u; ~ N(0, o7). If the interactions are scaled within 
rows, then the permissible maximum of p is 1 (Anselin, 2001; Bailey and Gattrell, 1995, 
Chapter 7), and the permissible minimum is the smallest eigenvalue of W, which is 
greater than —1 but less than 0. Since spatial correlation is positive in the great majority 
of econometric or health applications, a prior on p constrained to [0, 1] is feasible in 
many applications. 


7.2.1 Joint vs. conditional priors 


Letting Q = I — pW, the precision matrix £~! of the errors e in Equation (7.3a) may be 
derived as 


S =10'0 (7.3b) 


where t = o7? (Richardson et al., 1992). A full multinormal scheme for the e could be 
used, with the errors sampled simultaneously from their joint prior 


e ~ N,(0, 2) (7.3c) 


However, a conditional scheme is possible, and may be simpler to sample from (e.g. in 
WINBUGS). As mentioned by Wakefield et al. (2000), modelling of spatially correlated 
errors may proceed by initially specifying either the joint multivariate distribution of the 
vector e, or the univariate density of each areas error, e;, conditional on the current 
estimate of errors in other areas (e;, j 4 i). Conditions that ensure the joint density is 
proper (so that the e; are identifiable) when the model specification starts with a 
conditional rather than the joint prior? are discussed by Wakefield et al. (2000) and 
Besag and Kooperberg (1995). 

One possible conditional prior (the conditional autoregressive or CAR prior) 
expresses e; in Equations (7.3) in the centred univariate Normal form 


ei ~ N(Mj, 0°) (7.4a) 


where the mean of each area’s error 


M; =p yo Cie; (7.4b) 
j 


is a weighted average of errors in contiguous areas, and p is bounded by the inverses of 
the minimum and maximum eigenvalues of C. (Note that the interaction matrix for this 
form of prior needs to be symmetric.) The covariance of the vector e in the correspond- 
ing joint prior is then X = ø? (I — C)! (Richardson, 1992; Wakefield et al., 2000). 
One may also have a ‘spatial effects’ or ‘spatial autoregression’ model, with spatial 
lags in the outcomes themselves (e.g. Ord, 1975; Anselin, 2001), with 
Y=xB pWY u (7.5) 


where u is white noise. Spatial dependence in both Y and e may occur in the same model, 
for example: 


Y=xp p,WY e 
e=p,We u 


> The identifiability issue with the ICAR(1) model is discussed in Section 7.2.1. 
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where e is multinormal as in Equations (7.3). In some situations, a spatial lag in the 
outcome might not be substantively sensible. If it were, then a sensitivity analysis might 
consider both correlation in both spatial effects and spatial errors, and also consider 
non-constant variances in the u; (Lesage, 2002); it might also encompass several forms 
of spatial interaction matrix, even if these are taken as fixed within the modelling 
(Richardson and Monfort, 2000; Anselin, 2001). 

It may be noted that the multinormal spatial disturbances model (7.3a) may be 
expressed as 


Y —pWy = xR —-pWxB u (7.6) 


namely as a regression of the ‘filtered’ outcome Y* = Y — pWy on the filtered predict- 
ors x* = x — pWx. So the pure spatial effects model (7.5) is in fact a particular case of 
(7.6). 

For continuous outcomes, it is in fact simpler in WINBUGS to use this filtered 
outcome and predictor approach*. Thus, in Example 7.1, the means for a metric 
outcome Y; are p,;=pWy xB — pWxB. With count or binomial outcomes, by con- 
trast, one might have an error directly in the form (7.3) or (7.4) in the regression link. 
Thus, if Y; ~ Poi(w,), one might take 


log(w;) = xB ei 


ei ~ N(M;, o°) 


where M; is as in (7.4b).Example 7.3 accordingly contains a CAR prior analysis of the 
Scottish lip cancer data. 

Whatever the form of distribution relevant to the outcome, there are possible issues of 
identifiability involved’, since for instance clustering of regression errors may be pro- 
duced by a form of spatial heterogeneity, in that error variances differ between subsets 
of regions. Suppose, following Anselin (2001), spatial heterogeneity is taken to refer 
either to non-constant error variances (heteroscedasticity) or to non-constant regression 
coefficients across space, sometimes known as ‘spatial regimes’ (Anselin, 1990). Then in 
a single cross-section, spatial autocorrelation and spatial heteroscadisticity are observa- 
tionally equivalent. Thus, a clumping of positive residuals in a set of neighbouring areas 
might reflect either mechanism. 

Spatial heteroscedasticity may be parameterised in various ways: either one may 
suppose all areas to have distinct variances, or there may be groups of areas S, with 
o? = o? if i € S,. Lesage (2000) proposes scale mixtures (with each area having its own 
variance) to robustify inferences against outlier data points. This applies even after 


4 This is because for continuous data following a Normal or Student t density, a centred form is adopted in 
WINBUGS which assumes unstructured errors. 

> Highly parameterised but less well identified models (e.g. combining both spatial heterogeneity and cluster- 
ing) can of course be rejected by model fit criteria which penalise complex models which do not add greatly to 
accurate prediction. This will be illustrated by loss of precision in fitted means or predictions of new data. One 
approach to such predictive fit is that of Gelfand and Ghosh (1998), Sahu et al. (1997) and Ibrahim et al. 
(2001). Let f; be the observed frequencies, þ be the parameters, and z; be ‘new’ data sampled from f(z|ẹ). 
Suppose v; and ç; are the posterior mean and variance of z;, then one possible criterion for any w > 0 is 


Das os [w/w Ido Si 
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spatial autocorrelation is explicitly modelled, and so would relate to the density as- 
sumed for the u;. Thus, a scale mixture on the o in (7.3) or (7.4) is equivalent to 
assuming Student ¢ rather than normal u;. Such a mixture is illustrated in Example 7.2 
(involving the Columbus Crime dataset). 


7.3 SPATIAL EFFECTS FOR DISCRETE OUTCOMES: ECOLOGICAL 
ANALYSIS INVOLVING COUNT DATA 


Within the last decade, models for spatial dependence in discrete data (e.g. count or 
binary) have seen a major development. Spatial dependence figures strongly in the 
analysis of disease maps, where event counts rather than metric outcomes are the 
usual focus and where much recent conceptual and methodological development has a 
Bayesian orientation (e.g. Besag et al., 1991). In spatial econometrics, by contrast, more 
attention has focused on the spatial probit model for binary outcomes (e.g. Lesage, 
2000). 

In epidemiological analysis of small area disease data, the main object is often to 
estimate the true pattern of relative risk in the face of overdispersion in the observed 
event counts and spatially correlated errors due to omitted predictors. Estimation of 
relative risks by conventional methods based on the Poisson density (e.g. by standard 
mortality ratios defined as ratios of observed to expected events) assumes that the 
disease or mortality risk is constant over areas and over individuals within areas. In 
fact, individual risks may vary within areas, and risks vary between areas, so that 
observed event counts show a greater variability than the Poisson stipulates. This 
variation can be modelled by expressing area relative risks in terms of one or more 
random effects. 

Some such effects may be spatially unstructured (analogous to white noise in time 
series), and these are sometimes denoted as “excess heterogeneity’ (e.g. Best et al., 1999, 
p. 132). However, overdispersion may also occur due to spatially correlated effects. As 
mentioned above, spatial epidemiology often focuses on clustering of adverse health 
events due to socio-economic or environmental factors; observations on the latter are 
frequently not available or at least, not at sufficiently low spatial resolation. The prior 
model for spatially clustered sources of overdispersion may be seen as proxying unob- 
served risk factors (e.g. environmental or cultural) which are themselves spatially 
correlated (Best, 1999). 

For example, suppose a count of deaths D; from a certain cause is observed in 
a set of small areas, and that expected deaths E; (in the demographic sense) are 
derived from some standard schedule of death rates from the cause concerned. The 
outcomes may, subject to the necessity to take account of overdispersion, be taken as 
Poisson, 


where 0; is the relative risk of mortality in area i. Poisson sampling may be justified by 
considering binomial sampling of deaths by age Dj, in relation to populations by age Pi 
with death rates m, and by assuming relative risks and age rates are proportional, 
namely m; = 0,7; (Wakefield et al., 2000). 

Consider the Poisson sampling model as it stands, with the 0; initially taken as fixed 
effects. Then with flat priors on them, the estimated 0; will be equal to the maximum 
likelihood estimates of SMRs, namely 6; = D; /E;. However, these may be unreliable as 
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estimates of relative risk (e.g. see Bernardinelli and Montomoli, 1992), since maps of 
SMRs are subject to distortion through low event counts or populations at risk, and 
small changes in event totals may produce major shifts in the SMRs. In devising models 
to ‘pool strength’ and reduce such anomalies, one may envisage the total variability in 
the observed rates or SMRs having two components: within area sampling variation 
around the true underlying rate, and between area variations in the true rates, which are 
likely to show spatial correlation to some degree. 

To allow for unstructured ‘white noise’ variability about the true rates, one might 
take the log of the risk 6; to consist of an overall average and a white noise error u;. The 
model is then 


Di~ Poi(£;6;) (7.8a) 
log(@) =y ui (7.8b) 
u; ~ N(O, 02) (7.8c) 


where y is an intercept representing the log of the average relative risk in the areas 
compared to the standard population. However, there may in addition to spatially 
unstructured variability be spatially correlated errors e;, so that (7.8b) becomes 


log (0;) = y Ui ej (7.9) 


This is sometimes known as the ‘convolution’ model. 


7.3.1 Alternative spatial priors in disease models 


There are several alternative specifications for the e;. One option suggested for disease 
mapping and ecological regression (Leyland et al., 2000) assumes® that underlying the 
spatial errors e are unstructured errors v. Under this approach joint densities may then 
be readily specified for u and v. So 


en = So wiv (7.10) 
j=l 


where the wy are row standardised interactions as above. If the wy are based on 
contiguity, then w; = 1/N; if areas i and j are adjacent, with N; being the number of 
neighbours of area i. In this case of contiguity interactions, e; = N Dar, v;, with L; 
denoting the neighbourhood of areas adjacent to i. 

In the convolution model (7.9), one might take u; and v; to be bivariate Normal with 
covariation o,,, but this modelling approach extends readily to multivariate error 
forms, for outcomes j = 1,.. J. For instance, an MVN prior of dimension 2J allows 
correlation between outcome specific errors uj and vj, and so expresses interdependence 
(in regression errors) between the outcomes (Congdon, 2002a). This is illustrated below 
with data for Glasgow postcodes and counts from two causes of death. 

As for continuous outcomes, full joint prior specifications for the spatial errors e; may 
be proposed or conditional priors specified ab initio. One possible joint density specifi- 
cation (Besag et al., 1991) often used in health applications is in terms of pairwise 
differences in errors, so that 


© This type of error structure may also be adopted for continuous outcomes (e.g. Kelejian and Robinson, 
1993), but a particular recent application is to disease counts. 
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He) exp| 056°? Ye, o (7.11a) 
i<j 


The corresponding conditional form (Wakefield et al., 2000, Equation (7.15); Bernardi- 
nelli et al., 1999, Equation (26.4)) is 


P(eile;, j Fi) ~ N(M;, 07) (7.11b) 


with 


~ `~ ~ 
Mi; = > Cjej/ > Ci = > Wie; 


J#i J#Ai J#i 
and c; being spatial interactions as above. The variances differ by area with 
CHK y Se Cij 
Ji 

This scheme, known as an intrinsic conditional autoregression or ICAR(1) prior (since 
the correlation p is set by default to 1), is improper, since it specifies only differences in 
log relative risks, not levels. 

Suppose the cy form a contiguity matrix with cj = 1 if areas i and j are adjacent and 
cij = 0 otherwise. In this case, the ICAR(1) model for the e; has parameters 

M; = ĉi 

o? = k?°/N; 


L 


(7.11c) 


where g; is the average of the e; in the locality L; of area i (with the average excluding e; 
itself), and N; is the number of neighbouring areas in L;. Note that Student ¢ errors e; 
may be achieved by a scale mixture on K? in Equation (7.11c), that is k? = \;k”, where 
A; ~ G(0.5v, 0.5v) and v is the degrees of freedom in the Student ¢ density. 

Other pairwise difference priors are possible. Besag et al. (1991) mention a double 
exponential (Laplace) prior 


pe) xxexp|-05x 9-6-6 

i<j 

which, like the Student ¢, is more robust to outliers or discontinuities in the risk surface. 
x is a scaling parameter, with smaller values implying smaller spatial variability. 

The fact that Equation (7.11a) is improper may lead to problems in convergence and 
identifiability in Bayesian estimation based on repeated sampling. One way of producing 
identifiability is to omit the constant (such as y in Equation (7.9)) so that the average of 
the e; defines the level. Another is to constrain the e; to sum to zero, which in practice 
involves centreing at each iteration in an MCMC run (Ghosh et al., 1998). This identifia- 
bility option for the ICAR(1) model with e normal (i.e. k? constant over areas) is 
implemented in WINBUGS13 (and subsequent versions) as the ‘carnormal’ density. 
The Laplace pairwise difference prior is available in WINBUGS as the ‘carl1’ density. 

Another strategy involves a model redefinition. Thus, following Sun et al. (1999, 
2000), propriety of the posterior is obtained by explicitly introducing a spatial correl- 
ation parameter p absolutely less than 1. So for contiguity interactions, M; = pé; and 


Pleile;, j # i) ~ N(pe;, K/N) (7.12) 
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(compare Sun et al., 2000, Equation (2)). Whereas prior (7.11) is then an ICAR(1), the 
more general approach is denoted an ICAR(p) prior. 

Consider the ‘convolution’ model of Equation (7.9), where e; is based on the pairwise 
differences prior (7.11a) and an ICAR(1) model is assumed. To assess the relative 
strength of spatial and unstructured variation in this model requires estimates of 
marginal rather than conditional variances, and so one may form ‘moment estimates’ 
of the marginal variances, such as 


ô’ =S (e-e /n-— 1) 


This may be compared with the posterior estimate of the marginal variance of the u;, or 
with a comparable moment estimator 


6, =X (u -u/n 1) 


There may be identifiability problems in the convolution model, especially if flat priors 
are taken on the overall variances (precisions) of both the u; and e;. The posterior results 
are dependent on the prior specifications regarding the balance of dispersions on the 
two components, especially if, say, the precision of the unstructured errors was specified 
to be relatively high. Based on a simulation analysis, Bernardinelli et al. (1995) recom- 
mend that var(u;) = 0.7 var(e;), where var(e;) is the conditional variance K?; this is 
known as the ‘balanced dispersion’ prior. More generally, priors for the variances of 
e; and u; might reflect their interdependence, e.g. by taking a prior on the ratio 
var(e)/var(u) on the ratio var(e)/[var(e) var(u)] (see Example 1.3). 


7.3.2 Models recognising discontinuities 


It may be noted that doubts remain about the ability of the convolution specification 
to reproduce discontinuities in disease maps (e.g. a low mortality area surrounded 
by high mortality areas will have a smoothed rate distorted by a spatially correlated 
error model). Forms of discrete mixture have been proposed as more appropriate to 
modelling discontinuities in high disease risk (Militino et al., 2001). For example, 
Knorr-Held and Rasser (2000) propose a scheme whereby at each iteration of an 
MCMC run, areas are allocated to clusters. These are defined by cluster centres and 
surrounding contiguous areas, and have identical risk within each of them. Clusters may 
be redefined at each iteration. The estimated relative risk for each area, averaged over 
all iterations, is then a form of non-parametric estimator, and may better reflect 
discontinuities. 

Lawson and Clark (2002) propose a mixture of the ICAR(1) and Laplace priors, with 
the mixture defined by a continuous (beta) density rather than a discrete mixture. So 
Equation (7.9) becomes 


log@)=y u ne U-n)f 
where f; follows the conditional Laplace form, and one might take the beta prior on 
the n; to have fixed parameters, for instance n; ~ Beta(1, 1) or n; ~ Beta(0.5, 0.5). 
Options on such a scheme include setting priors on the hyperparameters {a, b} in 
the beta mixture ny; ~ Beta(a, b), or simpler mixture forms, such as just putting 
more emphasis on the unstructured component in some ‘discontinuous’ areas: 


log@)=y nw U-n))ei 
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7.3.3 Binary outcomes 
For binary data defined over areas i = 1, .. n 
Y; ~ Bern(m;) 

one might model spatial and nonspatial error effects via the logit or probit links on 7;. 
For binary data the variance is not identifiable, so it is preset, usually to 1. One might, 
for instance, specify a model with a CAR prior on a spatial error, so that 

logit(7;) = xB ej 
where 


ei ~ N(M;, 1) 


and where the mean of each area’s error is 
Mi =p y Cijej 
J 


An ICAR(1) or ICAR(p) model could also be specified for the e; in this model. If 
both unstructured and structured errors (u; and e;) are envisaged, then their respec- 
tive variances would be shares of an overall known quantity, for example 
var(u) var(e) = 1. 

The majority of recent Bayesian development has been on the latent variable model 
and on the multinormal spatial error model, as in Equation (7.3). Thus, 


Y,=1if Y;*>0 
Y;=0if Y;* <0 


where 
Y;* = xB e 
ej =p 5 Wie; Uj 
J 
ui ~ N(O, 1) 


The vector of errors then has a multinormal form, 
e ~ MVNO, [I - pW) - pW") 


Lesage (2000) emphasises possible heteroscedasticity in binary data, so that 
ui ~ N(0, «;), where the k; ~ G(v, v) average 1 and v is the Student degrees of 
freedom. 


Example 7.1 Agricultural subsistence and road access The first worked example con- 
siders spatial dependence in the errors of a regression model for a continuous outcome, 
using the alternative ‘filtered’ form of the spatial errors model in Equation (7.6). Several 
studies have considered a dataset for thei = 1,.. 26 Irish counties relating the proportion 
Y; of the county’s agricultural output consumed by itself (i.e. subsistence) to a measure of 
its Arterial Road Accessibility (ARA); a normal approximation is generally adopted to 
this binomial outcome. The data is discussed and analysed in Cliff and Ord (1981). A 
linear model containing homoscedastic errors u; ~ N(0, o2) and with no allowance for 
spatial dependence 
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Yi= xip uj (7.13) 


(with xo; = 1) then serves as the baseline. 

As one model diagnostic (though not a model choice criterion), measures of spatial 
interaction such as Moran’s J may be monitored. Here the intention is to assess in the 
context of Bayesian sampling estimation how and whether such interaction measures 
are affected by different model spefications. Note that these are derived statistics, and 
there is no need to specify a prior on them; for example, in Equation (7.13) only priors 
on B and the variance (or in BUGS the precision) of the u; are required. In their original 
applications via classical (e.g. ML) procedures, estimates of such interaction measures 
may have been based on regression residuals and their estimated standard error 
obtained by assuming particular modes of data generation. The Bayesian sampling 
perspective means one is not constrained in this way, and posterior standard errors 
are obtained by repeated sampling. 

Here such an approach is exemplified using regression residuals as monitored over 
t=1,..,T iterations, for area i, denoted ue”. Then the sampling average of Morans 7 


statistic is obtained as 
1 = TOSI yl? WP 
tid iFj 

where in the present application, two definitions of (row standardised) interactions 
w; are considered. One is based on simple contiguities, the other on standardised 
weights based on intercounty distances dj and length of shared boundary, so that 
cij = By/dj, where By is the proportion of the boundary of county i in contact with 
county j, and then wy = cj/Xjcj. The latter interactions are supplied by Cliff and Ord 
(1981, p. 229). 

The Moran statistic typically has a small negative expectation, when applied to 
regression residuals (Cliff and Ord, 1981). One may be confident, however, that lower 
values of this statistic represent lesser autocorrelation, despite a caveat that the value of 
I = 0 does not correspond to the lowest possible autocorrelation (see Haggett et al., 
1977, p. 357). Note that one might also apply such measures to the unstructured errors 
for a discrete outcome, such as a disease count response where the model omitted spatial 
effects as in Equation (7.8b). 

With the baseline uncorrelated errors model, three chains are run for 15 000 iterations 
and posterior summaries are based on the last 14000 of these’. The monitoring includes 
Moran statistics for the regression residuals as in Table 7.1. These are similar to those 
cited by Cliff and Ord, for contiguity weights, namely 0.397 (s.e. = 0.12) and for 
standardised weights, namely 0.436 (s.e. 0.14). Although the linear regression has high 
predictive R?, there is under-estimation of subsistence in the remoter counties, and over- 
estimation of subsistence in the less isolated eastern counties, with better road and rail 
links. One option would be to include measures of such transport access, e.g. whether a 
county is served by a direct freight link to the Irish capital, Dublin. 

However, to make correct inferences about the regression estimate of subsistence on 
ARA, it is probably necessary to explicitly model the spatial dependence in the regression 
errors. Here contiguity weights are used in the spatial errors model of Equation (7.3) 


7 As elsewhere, outlier status is assessed by the CPO statistics obtained by the method of Gelfand and Dey 
(1994, Equation (26)); the product of these statistics (or the sum of their logged values) gives a marginal 
likelihood measure, leading to a pseudo-Bayes factor (Gelfand, 1996). The CPOs may be scaled as proportions 
of the maximum giving an impression of points with low probability of ‘belonging’ to the main data set. 
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Table 7.1 Models for subsistence rates 


Mean St. devn. 2.5% 97.5% 
Uncorrelated error model 
Moran(Distance-Boundary Weights) 0.45 0.12 0.21 0.69 
Moran(Contiguity) 0.35 0.14 0.09 0.63 
Bo (Intercept) —8.71 3.56 —15.62 —1.71 
Bı (ARA) 0.0053 0.0008 0.0038 0.0069 
Spatial errors model 
Moran(Distance-Boundary Weights) 0.064 0.11 —0.101 0.324 
Bo (Intercept) 0.475 0.688 —0.797 1.926 
Bı (ARA) 0.0021 0.0007 0.0007 0.0036 


p 0.914 0.08 0.703 0.997 


and with the regression means based on the transformed model in Equation (7.6). For 
improved identification, the intercept parameter" is represented as Bo = Bo — Bop. 

A run of 20000 iterations over three chains with 5000 burn-in ensures convergence in 
p, with its posterior density as in Figure 7.1. The median of 0.936 compares to a Bayes 
mode of 0.938 cited by Hepple (1995). The effect of ARA on subsistence is halved, and 
the Moran statistic based on the alternative spatial interaction definition (distance 
and common boundary) is clearly reduced in line with eliminating spatial dependence 
in the regression residuals (see Table 7.1). 

At individual area level, the CPO estimates do not show any clear outliers; the lowest 
scaled CPO is 0.09 for county Mayo (area 16) which has its subsistence rate under- 
predicted. The overall marginal likelihood based on these CPOs is —62.7, a clear 
reduction compared to that for the uncorrelated error model in Equation (7.13) (namely 
—72.8). 


Example 7.2 Columbus neighbourhood crime Anselin (1988) considers data on crime 
rates Y; (burglaries and vehicle thefts per 1000 households) in 49 neighbourhoods of 
Columbus, Ohio, with predictors income (X;) and housing values (X2) in thousands 
of dollars. He considers first an uncorrelated errors regression which gives, with stand- 
ard errors in brackets, 


Y =68.6 —1.60X%,; -0.27% 


(4.7) (0.33) (0.10 (7.14a) 


By contrast, two spatial models may be considered, namely a spatial autoregressive 
model’ and a spatial disturbances model. Here the latter is estimated, according to 


8 Sampled values of the true intercept By = 8y/(1 — p) will be essentially undefined when sample values of p 
are very nearly 1, and this will affect MCMC convergence. The true intercept may be estimated using posterior 
means of Bo and p. 

° Thus, 


Y =y PWY yX1 Y% u 


where u might have constant or area specific variance but is spatially uncorrelated. This is estimated by Anselin 
to have parameters p, = 0.43(0.12), yı = —1.03(0.31) and y, = —0.27(0.09). 
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DENSITY 


Figure 7.1 Plot of p 


Y=Bo BixX1 B2% e 
e=p,We u 


(7.14b) 


where u is initially taken as homoscedastic, but then a heteroscedastic alternative to 
account for outliers is considered. W is a row standardised weights matrix based on 
contiguities cj. Anselin employs maximum likelihood estimation of the spatial error 
model to give p; = 0.56 (s.e.=0.13). The coefficient of income is much reduced as 
compared to an uncorrelated error model (7.14a), namely to B; = —0.94 (s.e.=0.33), 
but that of housing value is enhanced, with B, = —0.30(0.09). 

Here the correlated error model is first estimated with u homoscedastic, and with a 
uniform prior over (0, 1) for pı. N(0, 10) priors are adopted for B; and B, to avoid 
numeric overflow. The model for the regression mean is re-expressed according to the 
transformation (7.6), namely 

Y=xp pW(Y—-xB) u (7.15) 
with u ~ N(0, o°). Early convergence in a run of 20000 iterations over three chains is 
apparent, with iterations from 1001 giving posterior estimates as in Table 7.2. 

These estimates are similar to those of Anselin (1988), and confirm the reduced 


impact of income (as compared to an uncorrelated errors model) when spatial correl- 
ation is modelled in conjunction with homogenous non-spatial errors u;. Examination 


Table 7.2 Crime rates, spatial error model, homoscedastic u 


Mean St. devn. 2.5% Median 97.5% 


A 31.3 13.0 9.3 29.7 60 
Bi —1.05 0.41 —1.87 —1.05 —0.25 
B2 —0.26 0.10 —0.47 —0.26 —0.06 
P1 0.55 0.20 0.13 0.56 0.90 


o? 124.4 27.6 82.1 120.3 190.2 
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of the CPO statistics suggests area 4 is suspect; this area is aberrant in terms of having 
the lowest crime rate but the sixth lowest income. (The two are usually inversely 
associated.) 

To allow for the possibility of non-constant variance over space (Lesage, 1999, 1997), 
and ensure more robust inferences about regression effects in the face of possible 
outliers, Student ¢ disturbances u; are adopted in the SAR model. This involves taking 
the precision of the ith observation 


K;T 
where the multiplier 
K; ~ G(v/2, v/2) 


This scale mixture of Normals is equivalent to a Student ¢ with v degrees of freedom. 
The degrees of freedom is also taken as a parameter. Outliers are indicated by low 
multipliers k;. 

In addition to a scale mixture, an alternative prior structure on p; is also proposed as 
in Lesage (1997). This may be seen as achieving robustness on the spatial interaction 
component of the model. Hence 


Tmp) x (s Dej (7.16) 


and for s > 1 values of pı near zero are downweighted. Lesage tries pre-determined 
values s = 0, 0.5, 1, 2 and 5, but here a discrete prior is adopted on these options, so 
that there is a joint prior structure 


72(p,|S)a1(s) 


The value s=0 is equivalent to a uniform prior over (0, 1). The prior (7.16) is 
implemented in BUGS via a discrete prior on the points p; = 0, 0.01, 0.02, ...0.99, 1, 
with original weights (s 1)p] scaled to sum to unity. 

For the spatial errors model, this leads to a higher value of p,, and v approximately 27 
— indicating some departure from normality in the crime rate outcome (see Table 7.3). 
The mean value of s is 1.76. This model makes the impact of income more negative, 
while the housing value coefficient now straddles zero. The choice between this model 
and the homoscedastic errors model (7.15) is based partly on its better accommodation 
of possible outliers, and also its better marginal likelihood (—189.7) compared to that of 
the homoscedastic model (—193.9). The resulting pseudo-Bayes factor clearly favours 
the more robust assumption on the u;. The Student ¢ weights x; still show observation 
4 as a possible outlier, with k4 = 0.48 compared to the remaining areas, where K; ranges 
between 0.70 and 1.11. 

A suggested exercise is to fit the spatial autoregression model (with a spatial lag in Y) 
under alternative prior assumptions (homoscedastic vs hetersoscedastic) on the errors u. 


Table 7.3 Spatial errors: non-constant variance 


Mean St. devn. 2.5% Median 97.5% 


Bo 19.09 11.78 2.00 17.22 47.98 
By —1.11 0.38 —1.86 —1.11 —0.35 
B2 —0.17 0.11 —0.40 —0.16 0.04 
v 27.8 27.9 2.6 14.2 93.9 


p 0.72 0.18 0.29 0.75 0.99 
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Example 7.3 Poisson data: the lip cancers example Spatial models for a discrete 
outcome are illustrated with the Scottish lips cancer data, included in the WINBUGS 
examples and widely analysed elsewhere. These are often small event counts Y; for 
i= 1, .. 56 counties, and for disease mapping purposes the goal of introducing random 
effects (unstructured and/or spatially dependent) is to obtain smoothed estimates of the 
Relative Risks (RRs) of cancer incidence by county. Inferences on regressors will be 
improved by taking account of possible overdispersion in the count outcome. 


A single regressor is provided in this example by 0.1 times the percent of labour force 
in agriculture and related occupations (x, = AFF/10), so the adjusted relative risks of 
lip cancer will also reflect county occupational structure as well as smoothing towards 
the local or global average. Interactions cj between counties are in terms of dummy 
indicators for contiguity (cj = 1 if counties 7 and j are neighbours). Model fit and 
regression inferences are assessed under different specifications for the prior on the 
errors e; in Equation (7.9). 

First, consider the CAR prior as in Equation (7.4), with a prior for p defined by the 
eigenvalues!’ of C. Then the model is 


Y; ~ Poi(E;0;) 
log (0) =x; ei=ßBo Bix ei 
ei~ NM, o°) 


Mi = p% cye; 
J 


The posterior estimates from a three chain run of 10000 iterations (with convergence 
after 500) shows a mean correlation of around 0.16, and 95% credible interval (0.12, 
0.17). The AFF variable has coefficient B averaging 0.46 and 95% credible interval 
(0.21, 0.71). The lowest scaled CPOs are for counties 50 and 55, namely 0.03 and 0.025 
(these are Dundee and Annandale). These results may be compared to those obtained 
by Bell and Broemeling (2000), who cite an interval (0.131, 0.170) on p; the paper by 
Stern and Cressie (1999) has a list of counties corresponding to the numbers used here. 

If instead we use the ICAR framework with default correlation of p= 1 as in 
Equation (7.11c), a lower estimate of the coefficient on AFF is obtained, with mean 
0.39 and 95% interval (0.16, 0.60). Here the model for the mean relative risk omits for 
the intercept for identifiability, and so is 


log (0;) = BıXı; e; 


This model has an improved pseudo-marginal likelihood estimate compared to the 
CAR model (—156.2 vs. —162.3) giving a pseudo-Bayes factor in its favour of around 
400. 

However, an alternative is a data driven choice of the ICAR correlation in the 
ICAR(p) model, with the model for the means now having an intercept again. This 
leads to an enhancement of the AFF effect compared to the ICAR(1) model, with 
posterior mean 0.43 and 95% interval (0.09, 0.68). The ICAR correlation is estimated to 
have mean 0.92 and median 0.94. This model has a worse pseudo-marginal likelihood 
than the ICAR(1) model, around —159.5. 


10 The minimum and maximum eigenvalues of C for the Scottish county data are —3.07 and 5.71, so the range 
for p is between —0.326 and 0.175. 
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Finally, a robust mixture approach drawing on the Lawson and Clark (2002) method, 
as in Section 7.3.2, is fitted. This is coded directly in Model D (Program 7.3) using the 
double exponential density ddexp() in WINBUGS, and to produce identifiability 
an additional autocorrelation parameter as in Equation (7.12) is introduced. A two 
chain run of 5000 iterations shows convergence at around 500 iterations and gives a 
higher coefficient on the AFF variable than the preceding methods, with mean (and 
95% interval) of 0.49 (0.22, 0.76). The autocorrelation parameters on the Normal 
and Laplace mixture elements have means 0.61 and 0.88. The lowest n; (most outlying 
case in terms of the Normal spatial structure) is for county 49, which has an unusually 
low crude and smoothed relative risk compared to the surrounding counties (which 
both abut it and are similar in terms of urban and occupational status). Despite extra 
insights like this, the pseudo-marginal likelihood is worse than for the ICAR(1) model, 
namely—161. 

Smoothed relative risk profiles are consistent across the four methods: the CAR 
option shows 18 out of 56 counties with RRs clearly above 1 (95% credible interval 
entirely above 1) and the ICAR methods both show 19 out of 56. 


Example 7.4 Glasgow deaths The conditional autoregression approach has been a 
popular way of tackling spatial dependence in discrete outcomes (e.g. in disease map- 
ping). However, other techniques have been proposed and may have benefits in terms 
of, say, allowing both multivariate outcomes and a mix of unstructured and structured 
errors as in the convolution model (7.9). As a case study of an alternative methodology, 
the analysis of this example follows Leyland et al. (2000) in considering a bivariate 
mortality outcome for a set of 143 Glasgow postcode sectors, namely cancer and 
circulatory deaths { Yi, Yi2}. Thus defining the spatial error in terms of an underlying 
unstructured error as in Equation (7.10), and with two outcomes in a convolution 
model, there are four errors to consider. Without predictors, we might then have 


Yı ~ Poi(En 01) 

Yn ~ Poi(Enðn 
log (01) = Boy un en (7.17a) 
log (92) = Boo un en (7.17b) 


In the first model (Model A), four errors are distributed independently of each other, 
and u; and up denote independently distributed and spatially unstructured errors with 
variances oł and o3. The spatial errors under Model A in Equations (7.17) are obtained 


as 
eil i Vil 


Jel; 


=o 
en = N; Vj2 


jeLi 
with the v being taken as independently Normal with variances t? and 73, 
va ~ N(O, 77) 
vz ~ N(O, 73) 
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A second option (Model B) takes the e; and en in Equation (7.16) to be separately 
CAR(1), as in Equation (7.11c). Here the “balanced dispersion prior’ recommended by 
Bernardinelli et al. (1995) is used. 

However, an alternative model allowing correlated errors (i.e. multivariate error 
dependence among all four errors) is straightforward. Denote the first two error terms 
are Ny = ui and Nj = un, and the third and fourth as n; = va and Ny = v2. The nj 
are then taken to be multivariate Normal with mean vector (0,0,0,0) and 4 x 4 disper- 
sion matrix Xy. A Wishart prior on X}' is assumed, with 4 degrees of freedom. The 
standardised version of Xy provides correlations R = [rj] between the four underlying 
errors. 

Here interest will centre on whether the multivariate error model provides a better 
description of the data, and how estimated relative risks on the two types of mortality 
are affected by the model choice. Model summaries are based on three chain runs of 
10000 iterations with 1000 burn-in. Under model C, this leads (Table 7.4) to a correl- 
ation of only 0.08 between the two unstructured errors, but a correlation of 0.28 
between the two spatial errors, v; and vj. 

As noted above, spatially dependent errors may proxy unobserved covariates (or 
covariates not included in a regression) which are themselves spatially associated; here 
such variables might include area deprivation, for example. Evidence for missing 
spatially correlated covariates will be strengthened by high correlations between the 
vj. However, Table 7.4 shows that the correlations are not pronounced and none are 
conclusively positive. 

It may be that the simpler, less parameterised, model is adequate. Using the CPO 
statistics for areas i and outcomes j based on monitoring the inverse Poisson likelihood 
gives marginal likelihoods for the independent error Models A and B of —904 and 
—913, compared to —908 for Model C. So Model A appears to provide the best 
explanation for the data. In terms of inferences about risk, Model C identifies more 
extreme relative risks: for instance, 19 out of 286 = 143 x 2 area-outcome relative risks 
exceed 1.25 under Model C, but only 4 under Models A and B. 


7.4 DIRECT MODELLING OF SPATIAL COVARIATION IN REGRESSION 
AND INTERPOLATION APPLICATIONS 


The preceding discussion and examples consider continuous and discrete outcomes for 
zones. As described in the Introduction, one may consider outcomes for point based 
data in broadly the same way. Instead of a fixed interaction matrix W, defined perhaps 
by contiguity between areas i and j, for point data any predefined adjacency is unclear, 


Table 7.4 Bivariate mortality outcome with correlated 
errors 


Correlation Mean St.devn. 2.50% 97.50% 


Ry 0.08 0.14 —0.19 0.35 
Ri 0.25 0.15 —0.06 0.53 
Ry 0.22 0.15 —0.09 0.51 
Rz 0.17 0.16 —0.15 0.46 
Rog 0.19 0.15 —0.10 0.47 


R34 0.28 0.19 —0.12 0.61 
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and the influence of proximity needs to be estimated. Consider observations y;i = 1,.. n 
observed at points s; = (Sii, S2j,..--- ), which are here taken to be in two dimensional 
space. A starting point for estimating the effect of proximity is provided by the matrix of 
interpoint distances, dj = |s; — sj]. 


7.4.1 Covariance modelling in regression 
For instance, suppose the goal is, as before, to estimate a regression 
Y=xß e (7.18a) 
where the joint dependence between errors might be described by a multivariate Normal 
prior 
e ~ N,(0, 5) (7.18b) 


such that the dispersion matrix È reflects the spatial interdependencies within the data. 
Outcomes may also be discrete, and then one might have, for binomial data, say, 


Y; ~ Bin(7;, Ni) t= len 


logit(a;) = xiB ej 


where again, the errors may be spatially dependent. Let the n x n covariance matrix for 
e be 


> = o° R(d) 


where o? defines the overall variance (as defined along the diagonal when i= j and 
di = 0) and R(d) = [rj(dj)] models the correlations between the errors e; and e; in terms 
of the distances between the points'’. The function r is defined to ensure that 
raldi) = ri(0) = 1 and that R is positive definite (Anselin, 2001; Fotheringham et al., 
2000). 

Among the most commonly used functions meeting these requirements are the expo- 
nential model 


rj = exp ( — 3d;/h) (7.19a) 


where / is the range, or inter-point distance at which spatial correlation ceases to be 
important’*. The Gaussian correlation function has 


rij = exp (— 3d; /Ir) (7.19b) 
and the spherical (Mardia and Marshall, 1984) has r = 0 for d; > h and 
ry = (1—3dy/2h d}/2h°) (7.19c) 


for dy < h; see Example 7.5 for an illustration. In each of these functions, h is analogous 
to the bandwidth of kernel smoothing models. If $} = o? R(d), then the covariance tends 


11 Although this approach is theoretically based on point data, it can be extended to aggregated data by 
considering the centroid of an area to define its location in continuous space. Ideally, such centroids take 
account of variations in population density within areas; for instance, centroids of electoral wards in the UK 
may be based on aggregating over smaller sub-areas of wards (known as enumeration districts), with the ward 
centroid being a population weighted average of sub-area centroids. 

12 This interpretation of A is clearer in the variogram version of the exponential model, considered below. 


DIRECT MODELLING OF SPATIAL COVARIATION IN REGRESSION 291 


to o° as dj tends to zero. In some cases, there will further baseline variability (e.g. due to 
measurement error) and the covariance may be defined as 


LY=r oR) 


with the limiting variance as dj tends to zero being 7? o° instead of o?. Another 


formulation in this case involves a discontinuity when d; = 0, so that 


Ly = ryd) iF j 
Ši = T 


Whatever the model adopted for the correlation between errors, the problem reduces to 
simultaneously estimating the regression coefficients and the parameters of the distance 
decay function. 


7.4.2 Spatial interpolation 


In other situations, the emphasis may be on interpolation or prediction at locations 
s.new, on the basis of the observations Y;, i = 1,.., n made at points (taken to be in two 
dimensions) s; = (51;, 52;). An example of spatial interpolation or ‘kriging’ from a 
Bayesian perspective is provided by Handcock and Stein (1993), who consider the 
prediction of topological elevations Yney at unobserved locations on a hillside 
given an observed sample of 52 elevations at two dimensional grid locations. To 
emphasize the definition in space, denote the observations at s; as Y; = Y(s;) and 
assume that the model for Y(s;) consists only of a mean j(s;) which varies over space, 
where space here means the two dimensional real space. The most general process is 
then 


Y(s) = ws) e(s) 


where the error vector e(s) has mean zero and covariance o7(s)R, and the mean 
w(s) = EL Y(s)] models the first order trend over space. Trend, sometimes known as 
large scale variation, is exemplified by the North West to South East gradient in 
cardiovascular disease in England (Richardson, 1992), while small scale (second 
order) variation represents features such as spatial correlation in neighbouring areas. 

Just as for time series models, prediction to new locations may be facilitated by 
transformation of the data to remove trend and induce stationarity, meaning the 
mean and variance are independent of location, and in particular that cov(e) = o7R. 
Methods for spatial interpolation may, for instance, be based on analysing 
transformed observations or residuals e; which are free of trend. An example of such 
data transformation is the EDA method known as row and median polish, and illus- 
trated by Cressie (1993) on percents coal ash. One may also use regression to account 
for spatial trend: this is known as trend surface analysis, where the above model takes 
the form 


Y(s)= u(s) e(s) = x(5)B_—e(s) 


and the x(s) are functions of the spatial coordinates (s1;, 52;). One might have a trend 
regression model with x; = sı and x2 = sp (linear terms in the grid references), x3 and x4 
being squared terms in sı and s2, respectively, and xs being an interaction between sı 
and s2. Simultaneously modelling spatial effects in the mean and variance structure 
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may, however, be problematic (Ecker and Gelfand, 1997), and often a constant mean is 
assumed (this is known as ordinary kriging), with 
(s) = u = Bo 


After trend or location is allowed for in the model for p(s), the covariation in the errors 
may be modelled. Direct modelling of the covariance structure of the observations Y 
may also be of interest provided stationarity is plausible. 


7.4.3 Variogram methods 


A technique often used to estimate the covariance structure focuses on functions of 
dissimilarity between errors or observations. Let È(d ) denote the covariance matrix at 
distance d between points. Whereas 

X(d) = 0’ R(d) 
diminishes to zero for widely separated points and attains its maximum as dj; tends to 
zero, the variogram function 

2 2 

yd) = o% — 3(d) = o (I — R(d@)) 


has value zero when d; = 0, and reaches its maximum at øo? as spatial covariation in 
(d) disappears; hence g? is known as the sill in geostatistics. For instance, the vario- 
gram for the exponential model is 


y(dy) = 021 — e/h) (7.20) 
As above, an extra variance parameter to allow for measurement error at d = 0 (the 
‘nugget’ error) may be added to give 
yd)=v oM- Rad) 
and the sill is now v? ø?. An alternative version of a nugget error model is 
yd=v (0° -ôU — Rd) 


In fact, the expected value of the difference!? between observations Y; and Y; (or errors 
ei and e;) at separation dj is 2y(dj). Estimation of the variogram is often carried out 
within narrow distance bands, such that for the n(d) points within a band {d—s,d_ €} 
the estimated variogram is 
A 1 2 
W()=057—- Yi Œ-Y) 
n(d) dye{d—e, d £} 


13 For points y(s) and y(t) at separation d = s — t, the variogram is 
¥(s, À = 0.5ED(s) — yD? 
= 0.5Var[y(s) — y(ð] 
For an isotropic process (where the covariation does not depend upon the direction between s and ĉ), 
ys, N =s At A) 
and taking A = —t gives 
Xs, 0) = ys — 4,0) = yd) 
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So estimation of the variogram typically considering distances between all n(n — 1)/2 
pairs {i, j} of points and then grouping them into a relatively few (say 10 or 20) distance 
bands. Then the relation between (d) and d is modelled over distance bands 
d =1,..,D via nonlinear least squares in terms of parametric forms such as Equation 
(7.20). If n(n — 1)/2 is not unduly large, one might fit a nonlinear curve such as 
Equation (7.20) to all sets of paired differences (Y; — Yj)" or (e; — ey. 

However, recent Bayesian approaches have tended to focus on spatial interpolation 
consequent on direct estimation of the covariance matrix from the likelihood for 
Y; = Y(s;). Thus, Diggle et al. (1998) consider adaptations for discrete outcomes of 
the canononical model 


Y;=pu e(s;) ui (7.21) 


where the errors u; are independently N(0, 7°), and accounts for the nugget variance, 
and the vector e(s) ~ N,,(0, o? R) models the spatial structure in the errors. This can be 
seen as a continuous version of the convolution model of Equation (7.9). Ecker and 
Gelfand (1997) also adopt this likelihood approach, but consider generalisations of the 
usual parametric covariance structures such as those of Equations (7.19a)-(7.19c). 

Whatever the approach used to estimate (d), prediction of Y,., at a new point 
s.new then involves the n x 1 vector g of covariances g; = Cov(Snew, 5;) between the new 
point and the sample sites s1, s2... Sn. For instance, if (d) = o*e~*“/", then the covar- 
iance vector is obtained by plugging in to this parametric form the distances 
dinew = |Snew — 51|, Gonew = |Snew — S2|, etc. The prediction Ynew is a weighted combin- 
ation of the existing points with weights \;, i = 1,.. n determined by 


h=g>d! 


For example, the prediction Ypew under Equation (7.21) is obtained (Diggle et al., 1998, 
p. 303) as 


Ynew = H gI RY = u) 


7.4.4 Conditional specification of spatial error 


While joint prior specifications are the norm in such applications, one might also 
consider conditional specifications with the same goals (e.g. estimation of proximity 
effects and interpolation) in mind. Defining weights a, in terms of the parametric forms 
in Equation (7.19), and then defining w; = ajj/ $ jz; aij one might specify 


Y; a N(y,, 7) 
HSM ĉi 


where e; follows a conditional prior 


Epis N (= Wi), ob? / 5 w) 
J#i J#i 
This is the ICAR(1) model with non-binary weights, as in Equation (7.11b). For 
instance under the exponential model, aj = exp ( — kdj), where k > 0. As above, iden- 
tifiability may be established by introducing a correlation parameter p, which might, for 
instance, be assigned a prior limited to the range [0, 1], so that 
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ei~ n(o Wie}, e/a) (7.22) 

J#i j+i 
Another option under an ICAR(1) prior is to centre the sampled e; at each iteration in 
an MCMC run. For interpolation of Y at a new location Snew under this prior, one 
would evaluate the function aj in terms of distances between the observed locations 
S1, S2, ...8, and the new location spew. This gives a vector of weights anew = (anew, i) and 
after scaling to give Wnew, the estimate of Cnew = Ynew — p at the new location is formed 
as di@iWnew, js 


Example 7.5 Nickel concentrations The principles involved in deriving a variogram 
are illustrated using a set of geochemical readings (nickel in ppm) at 916 stream 
locations in the northern part of Vancouver Island in Western Canada. The logged 
nickel readings are taken to lessen skewness (cf. Ecker and Gelfand, 1997). The 
(916)x(915)/2 pairs of readings are accumulated into distance bands at a ‘lag’ 0.1 km 
apart. Many pairs exceed 5km in separation, and the estimation here only includes 55 
bands at average distances 0.05, 0.15, 0.25, ... up to 5.45 km. 


The observations at separation d consist of averages Gy of the differences squared 


Zij = SS (Gi - XY 
dije(d—0.05, d 0.05) 


for the ng pairs of points separated by distances between d — 0.05 and d 0.05km. As 
well as the average of Z in each band, the variance V4 of the differences (Y; — yy in 
each band is recorded. The means are then taken as normal with known precision 
na/Va, with their regression means following a spherical model 


Ga ~ N(y@), Va/na) 
y(d)=v? (o° — v’)3d/(2h) — ÈP] if d<h 
y(d) = 0? ifd>h 


A uniform prior on the ratio v?/(o” v?) is adopted, with the range assigned a uniform 
prior with minimum 0.05 and maximum 5. 

A two chain run of 5000 iterations leads to a nugget variance v? averaging about 0.31, 
a sill variance øo? of 0.87 and a range averaging 1.61. The range is lower than the value of 
2.13 reported by Bailey and Gattrell (1995), though these authors took wider bands (a 
0.2 km lag separation) to derive the variogram. 

A second model is provided by the Gaussian form with 


y(d)=v (0° — vl — exp(— 3d*/r°)] 


This provides a lower pseudo-marginal likelihood than the spherical model, where the 
relevant CPO statistics are estimated using the average inverse likelihoods. The nugget 
variance increases to around 0.4, while the range is estimated as 1.36 with 95% credible 
interval (1.23, 1.50). 


Example 7.6 Likelihood model for spatial covariation To illustrate the direct likeli- 
hood approach to estimating spatial covariation as a function of distance, 20 points s; 
are randomly generated within the square region defined by extreme SW point (0, 0) and 
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extreme NE point (100, 100). Distances between points are obtained as dj = |s; — sj|. 
The Y; are then sampled from a multinormal density with mean 160 and covariance 


S(d) = Rd) 


where R(d) = exp ( — 3d /h), the range h = 100, the measurement error variance 20, and 
the scale variance o? = 100. The points and their associated Y values are shown in 
Figure 7.2. 

To reflect the WINBUGS centred Normal structure, the likelihood for the data so 
generated is then univariate Normal as in Equation (7.21), namely 


Yi~N(w erT) 


with var(u;) = 7°. In the first joint prior approach for the spatially structured errors e;, it 
is assumed that 


e ~ N,(0, o° R) 


where r; = exp ( — 3d; /h). The maximum range for the region concerned is 141.4, and a 
discrete prior with 10 values from 14 to 140 is adopted. For the two variances a prior 
ideally expresses their interdependence, and here a prior on the sum ọ = 1/17? 1/0? is 
adopted with 1/t? = Ug, where U ~ B (1, 1). Since the model includes only a mean, a 
residual precision exceeding 1 (variance under 1) is unlikely for the sampled data, and 
two priors may be considered: a G(0.1, 1) prior with mean 0.1 and a G(0.1, 0.1) prior 
with mean 1. With the first prior, a two chain run of 2500 iterations (with convergence 
after 500) is taken. This yields a reasonable estimate of the range and the variance 
parameters and a prediction at (10, 10) of 172.4. It is left as an exercise to experiment 
with different priors for » such as the G(0.1, 0.1), and assess how predictions to (10, 10) 
or other points are affected. 

For the conditional prior approach, the same method as for the joint prior model is 
used to define the inverse variances 1/” and 1/7? (see Section 7.4.4). A N(0, 1) prior, 
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constrained to positive values, is adopted for k. With b ~ G(0.1, 1) a two chain run of 
2500 iterations (with convergence at under 500 iterations) yields an estimate for 1” close 
to the generating value (posterior median 20.3), with the median of b? estimated at 11.4 
(Table 7.5). The parameter k has a skew posterior with mean 0.12. The prediction at (10, 
10) is higher than under the joint prior, namely 175. 

The joint prior has a better fit than the conditional prior model according to the 
predictive loss method of Gelfand and Ghosh (1998), but its prediction seems to tend to 
‘oversmooth’, since the two closest points to (10, 10) have values 172 and 182. The 
prediction of the conditional prior is more in line with the neighbouring observed 
values. 


Example 7.7 Attempted suicide in East London Spatial covariance modelling for 
discrete outcomes is illustrated by data for attempted suicide (‘parasuicide’) in 66 
electoral wards in East London (the East London and City Health Authority). The 
records of this health event are based on hsopital admissions following deliberate self- 
harm. Following work by Congdon (1996) and Hawton et al. (2001), one may relate 
small area variations in parasuicide to both area deprivation and the strength of 
community ties or their absence (‘social fragmentation’). Both factors are represented 
by composite scores. The analysis involves actual and expected attempted suicides (Y; 
and £;) over the period 1990-92, for all ages and males and females combined, with ward 
locations defined by their centroids as supplied by the Office of Population Censuses 
and Surveys in the 1991 UK Census ward profile (and derived by population weighted 
averaging over the centroids of micro areas known as enumeration districts, nested 
within wards). 


A simple Poisson regression (without spatial or unstructured errors) shows both 
factors apparently influencing variations in the outcome, with the coefficients on 
deprivation and fragmentation having posterior averages (with 95% credible intervals) 
of 0.047 (0.031, 0.062) and 0.049 (0.028, 0.069). 


Table 7.5 Models for simulated data and prediction to new point 


Mean St. devn. 2.5% Median 97.5% 


Joint Prior 


u 171.3 5.8 159.0 171.3 183.2 
o? 80.6 42.6 25.3 71.9 183.4 
T? 19.0 15.8 2.1 15.1 58.8 
h 102.1 29.7 42.0 112.0 140.0 
Y.new, at (10, 10) 172.4 6.2 158.9 172.6 184.6 


Conditional Prior 


u 169.7 3.1 163.2 170.0 175.1 
K 0.12 0.05 0.02 0.12 0.23 
b? 20.3 32.9 1.1 11.4 101.6 
T? 32.1 33.6 1.0 20.3 119.9 
p 0.81 0.20 0.18 0.88 0.99 


Y.new, at (10, 10) 175.0 3.2 168.0 175.6 179.9 


DIRECT MODELLING OF SPATIAL COVARIATION IN REGRESSION 297 


However, there may be unobserved and spatially varying risk factors which produce 
spatial dependence in the regression errors. The second coding in Program 7.7 applies 
the direct estimation approach using the intrinsic conditional'* prior in Equation (7.22), 
and weights aj as in Equation (7.19a), to investigate the spatial structure of the errors. 
For identifiability, a spatial correlation parameter p is included so that 


Y; ~ Poi(£;6;) 


0; = Bx; e; 
ei~ n(o 5 Wij, o?/ 5 «| 
j#i j#i 


ay = exp ( aa Kdjj) 


wy = ay] 9 ay 
J#i 

A run of 2500 iterations (with convergence from 500) shows the distance decay param- 
eter k to average just under 2, and the error autocorrelation to average 0.74. The mean 
effect of fragmentation (though not deprivation) is reduced, and the credible intervals 
for both factors are widened. In particular, the fragmentation effect is no longer clearly 
positive (Table 7.6). 

The third model assumes a fixed interaction matrix, defined as c; = 1 if ward j is 
among the five nearest neighbours of ward i (in terms of Cartesian distance), and cj = 0 
otherwise. Then the ICAR(p) prior for spatial errors e; is used. This model shows an 


enhanced mean deprivation effect, while the fragmentation effect is essentially elimin- 
ated (Table 7.6). 


Table 7.6 Models for parasuicide 


Distance decay model Mean St. devn. 2.5% 97.5% 


Bı (Intercept) —0.13 0.11 —0.37 0.07 
Bə (Deprivation) 0.048 0.021 0.009 0.089 
B; (Fragmentation) 0.033 0.029 —0.026 0.089 
K 1.84 0.54 0.87 3.00 
p 0.74 0.22 0.18 0.99 
ICAR (p) 

8, (Intercept) —0.16 0.11 —0.37 0.07 
Bə (Deprivation) 0.056 0.020 0.016 0.096 
B; (Fragmentation) 0.017 0.028 —0.036 0.074 


p 0.75 0.18 0.32 0.99 


14 One might also consider a joint prior as in Equation (7.18b). However, computation in joint prior models 
involving a structured covariance matrix > is slowed considerably by the necessity to invert the covariance 
matrix during MCMC iterations. This can be avoided by discretising the prior on the parameters (apart from 
o?) in (d). Thus, for a K point grid on A in Equation (7.19a), one might pre-calculate K matrix inverses R}, 
where rj = exp ( — 3dj/h). Obtaining a suitable grid (one that encompasses the density of the parameter) may 
require some preliminary exploration of the problem. The set of matrix inverses is then input as extra data to a 
model where h is assigned a discrete prior with K bins. One may also sample directly from the relevant full 
conditional densities, as in Diggle et al. (1998), and certain options for spatial covariance matrices are now 
implemented in the GEOBUGS module of WINBUGS14. 
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In terms of the Gelfand and Ghosh (1998) criterion of footnote 5, the ICAR(p) model 
has a slight edge in fit, namely 5180 as against 5250 for the direct estimation approach 
(the same criterion for the simple Poisson model is 11 200). However, much can be done 
to improve the performance of the direct estimation method by varying the modelling of 
distance decay, for example using weights aj such as those in Equations (7.19b) or 
(7.19c) or taking a power rather than exponential decay system, with aj = d; K(k > 0). 


7.5 SPATIAL HETEROGENEITY: SPATIAL EXPANSION, 
GEOGRAPHICALLY WEIGHTED REGRESSION, 
AND MULTIVARIATE ERRORS 


With geographical data, the preceding sections have shown how regression errors may 
well be spatially correlated, and have sought to model spatial dependencies in errors. 
However, an alternative or at least complementary perspective on spatial outcomes is in 
terms of spatial heterogeneity, meaning either heteroscedastic variances, spatially vary- 
ing regression effects, or both. In the preceding worked examples, one might well allow 
the unstructured error such as u; in Equation (7.9) to have a variance that differs 
between individual areas, or perhaps be governed by a spatial regime, involving k < n 
subsets of areas, within which the areas have the same variance (Anselin, 2001). For 
instance, Example 7.2 exemplifies how the assumption of constant variance of the error 
terms may well be questionable, but the same approach could be applied in the other 
examples, and below the data in Example 7.3 is reanalysed from the heterogeneity 
viewpoint. 

While heteroscedasticity may well be relevant, an alternative perspective on spatial 
heterogeneity focuses on the regression parameters themselves: linear and general linear 
spatial models typically assume that the structure of the model remains homogenous 
over the study region without any local variations in the parameter estimates. It is likely, 
in fact, in many applications that regression effects are not constant over the region of 
application (Casetti, 1992). In fact, varying regression effects over space are analogous 
to the more familiar state space models in time series applications which involve time 
varying regression coefficients. 

Spatial heterogeneity in the impacts of regression predictors has been the subject of 
two major recent advances, namely the spatial expansion model Casetti (1972) and 
geographically weighted regression (LeSage, 2000). These have focused on the normal 
linear model (with modifications for heavy tailed errors via Student ¢ extensions), 
though extension to general linear models is quite feasible. Congdon (1997) describes 
an ICAR(1) model for spatially varying regression coefficients with a univariate Poisson 
outcome, and below a variation on the prior of Leyland et al. (2000) is considered to the 
same end. 


7.5.1 Spatial expansion model 


The spatial expansion model of Casetti (1972, 1997) assumes that the impacts of one or 
more of p regressors, xj; on a continuous outcome Y; vary according to the locations 
(Sii, S2;) of the areas or points. These locations may be taken with reference to a central 
point (0, 0), so that one or both elements in the grid reference may be negative. One may 
allow for fixed (i.e. non-spatially varying) impacts c1, ¢2,.. Cp of p regressors, as well as 
spatially varying ones b;(j = 1,.., p), though the constant co does not have a parallel 
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spatially varying effect. (The latter choice is made by Lesage, 2001, on identifiability 
grounds.) Then for a metric univariate outcome Y over areas i one might have 


Y; ~ N(w;, p) (7.23a) 


W; = Co CiXii C2X2i ---  CpXpi biiXii baiX2i -.-  bpiXpi (7.23b) 
with the spatially varying effect (or ‘base effect’) modelled as 
bi = SV Sayy j=l,- p (7.23c) 


There are 3p 1 parameters to estimate: the constant co, p fixed regression effects cj, 
and the 2p base effect parameters, y,; and y3. The combined non-spatial and spatial 
impact of x; then varies by area i, and is 


Bi =o by (7.23d) 
with average over all areas 


This model may therefore capture variations in the regression relationships, especially if 
clusters of adjacent observations have similar regression behaviours, or if economic 
relationships vary according to distance from city centres. 

One way of extending these models to allow for non-constant variance is to take area 
specific variance parameters b; as functions of the locations (and maybe other factors), 
with 


p; =exp(Yo Ysi Y252i) 


7.5.2 Geographically weighted regression 


The method of Geographically Weighted Regression (GWR) also makes regressions 
specific to the location of point or area i (Brunsdon et al., 1996, 1998). Somewhat like 
cross-validation with single case omission, GWR essentially consists in re-using the data 
n times, such that the ith regression regards the ith point as the origin. The coefficients 
bii, .. bpi for the ith regression entail distance based weightings of the ith and remaining 
cases (with 7 as the centre). These weightings define the precision parameter for each 
area in a Normal or Student ¢ likelihood. 

As in Equations (7.19a)-(7.19c), the weighting might be exponential, Gaussian, or 
spherical, with the Gaussian form being 


ay. = exp(—d;/2h’) h>0 (7.24) 


This function means that for small distances aj, is close to 1, with the decay effect 
increasing as the bandwidth parameter / tends to zero. In all these functions, a prior 
may be set on A, taking account of the maximum observed inter-point or inter-area 
distance. Other options are profile type analyses to maximise a fit criterion. Alterna- 
tively, one may combine GWR with formal cross-validation. This is based on omitting 
the ith observation from the ith geographically weighted regression, and then predicting 
it using the remaining n — 1 cases. 

Suppose the outcome Y+ is univariate and metric, with sample indices k = 1, ..,n and 
TEgressors X 1K, X2k, - - - Xpk for area k. Conditioning on a choice of spatial weight function, 
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for the ith regression, one might use the notation yi = Yp (providing n copies of the 
data), and then take 


Vie ~ Nis Te) k=1,..n (7.25a) 
Tik = 0; / Aix (7.25b) 
Bic = boi buxie baixas ~.. bpiXpk (7.25c) 


where ¢; is an overall variance, homogenous across areas in the ith regression. Since the 
aip decline with distance, nearer observations have lower variances (higher precisions). 
This implies a greater weighting in the ith regression such that observations near the 
area or point i have more influence on the parameter estimate b; = (boi, bii, .... bpi) than 
observations further away. 

As in the spatial expansion method, one might robustify against outlying areas (in 
terms of the estimated relationships between yip and xx) by taking a scale mixture 
(heavy tailed) approach. This allows non-constant variances with scaling factors kx 
drawn from a gamma mixture G(0.5v, 0.5v), where v is the degrees of freedom of the 
Student density. Thus 


Yik ~ Nw, Te) k=1,..n (7.26a) 
Tik = Qik /äik (7.26b) 
Pi = Q;/Kik (7.260) 


Lesage (2001) considers several spatial applications in terms of a choice between v = 30 
or 40 (essentially a fixed variance over all areas in the ith regression so that b = ¢,) 
and v varying between 2 and 9, leading to non-constant variances ,,. Under the latter, 
one may set a higher stage prior on v itself. For example, taking 


v ~ G(8, 2) 


would be an informative prior consistent with expected spatial heteroscedasticity, and 
allow one to discriminate between non-constant variances over space as against non- 
constant regression relationships. On the other hand, a model with the full GWR 
parameterisation {namely the set of parameters h; v; bj, j =0,..p; i= 1,..n; and 
bz, i= 1,.. n; k= 1,.. n) may become subject to relatively weak identifiability. 

Although not considered in the ‘classical’ GWR literature there seems nothing against 
random effects models for the b;; (the coefficient on variable j in the ith regression) as a 
way to pool strength and improve identifiability. Either unstructured effects could be 
used, with the 5; referred to an overall average p.b; or spatially structured as considered 
by Lesage (2001). 


7.5.3 Varying regressions effects via multivariate priors 


A third approach to spatially varying regression coefficients, with particular relevance 
to discrete outcomes, involves extending the prior of Leyland et al. (2000) and Langford 
et al. (1999). Note that this is a single regression model, not involving n fold repetition 
like the GWR method. Thus, consider a model for a count outcome Y; ~ Poi(£;6;) such 
that the coefficients on two predictors x; and z; were spatially varying. The convolution 
model is then 


log(@®) =a u; e Bry Yi (7.27) 
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where underlying the spatially structured effects e;, 8,, and y, are three unstructured 
errors v;i, vi, viz Which are linked to the effects in Equation (7.27) via scaled weighting 
systems which are here taken to be the same across the three effects: 


n 
ej = y WijVji 
j=l 


If also u; = v;4 then one may set a multivariate prior of dimension 4 (e.g. MVN or MVt) 
on the vy, with mean zero for v; and v;4 but means Bu and Yp ON v2 and v;3, where B 
and y, are themselves assigned priors. In the case of Equation (7.27), let 
By = (0, By, Yu» 9) be the mean vector and X, be the dispersion matrix of dimension 
4x 4. In general, for the convolution model as above but with p covariates with 
spatially varying effects, ©, will be of dimension (2 p). 

This model is relatively simple to extend to multivariate dependence — as compared to 
adapting, say, the ICAR model — if allowing for interdependence of the vj improves 
model fit. One might also envisage modifying the joint prior (7.18b) to allow several 
spatial effects (such as e;, B; and y; in Equation (27)) to both interact and have a 
different distance decay. 

Coefficients are likely to vary when the regression association shows inconsistent 
strength over the region of interest. Consider the coefficients on x;, and suppose pg 
is clearly positive. If a set of neighbouring areas have high regression coefficients 
(e.g. B; above the mean pg), this means that in this subregion high values of x; are 
consistently associated with high relative risks, and low values of x; with low relative 
risks. By contrast, a set of B; around zero in a particular subregion will reflect 
an inconsistent pattern: high values of x; are sometimes associated with low relative 
risk, sometimes with high relative risk. This perspective may provide additional 
insight into the sometimes cited rationale for the spatial term e; in the fixed coefficient 
model 


log (vi) =Q Bx Ui ĉi 


namely that e; proxies risk factors that are unobserved but smoothly varying in space. In 
fact, the varying coefficient model may clarify where in the region such unobserved risk 
factors are more relevant. 


Example 7.8 Columbus crime: spatial expansion and GWR Here the spatial expansion 
and GWR methods are considered in the context of the Columbus crime data set, as 
analysed in Example 7.2. For the expansion method, one might take the geographically 
varying coefficients bj in Equation (7.23c) to be functions of the distance d; of {s1;, s2;} 
from the Columbus centre. The centre may be taken as area 32 in terms of the order of 
areas set out in Anselin (1988). This gives a base effect with just p parameters (rather 
than the 2p parameters in Equation (23c)), namely 


bj = diy; PH lyeep 
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However, suppose the original grid references, namely eastings-northings {s1, s2}, are 
retained. This option involves 2p parameters in the base effect model, and coefficients 
for area i as 


bii = SUV S2iV2j J = 1, p 


Note that sı is lowest in the West and sz is lowest in the South. 

For illustration, the total effects on crime rates of income (xı) and house values (x2) 
for the first five areas are monitored over the MCMC run. These total effects are related 
both to the base effects operating via the locations {51;, 52;}, as well as the overall non- 
spatial effects, co (constant), cı (income), c2 (house values) after allowing for the spatial 
expansion. Note that a linear model with homogenous coefficients (and spatially 
uncorrelated errors) gives parameters and f-ratios as follows: 


co = 68.6(14.5), cı = —1.6(4.8), and cy = —0.27(2.7) 


Relatively informative N(0, 1) priors are assumed for y,; and y}. A full spatial 
expansion model as in Equation (7.23b) shows no non-spatial effect c2 of house values, 
though a negative non-spatial effect of income (cı) remains, and is more negative than 
under the homogenous coefficients model. Table 7.6 shows the combined effects, as in 
Equation (7.23d), for areas 1 to 5. The base effects may be interpreted as deviations 
from the overall effect as one moves from west to east or from south to north. Thus, the 
effect of income becomes less negative as one moves east, but more negative as one 
moves north. The effects of income in areas 1-5 are less negative than the average of 
—1.95 given by B4 (see Equation (7.23e)), while those of house values are more negative 
than the average. 

For the GWR analysis, a Gaussian distance decay function is taken, with 


dik = b(dix/ [sah]) 


where ¢ is the standard normal density, sy is the standard deviation of all the inter- 
neighbourhood distances and h = 1 is taken as known (as indicated by a preliminary 
search over values between 1 and 10). As is usual in a GWR analysis, varying regression 
intercepts between the n regressions are allowed, as well as varying predictor effects over 
space (see Equation (7.25c)). The Bayesian sampling approach to estimation means one 
can monitor statistics such as the minimum and maximum b;; for predictor j over all 
i=1,..,m regressions, and also the probability that the ith regression provides the 
maximum or minimum coefficient. 

From a single chain of 20000 iterations (500 burn in), Table 7.7 shows that the 
average income and house value effects, denoted bavg[] in Program 7.8, are similar to 
those obtained from the homogenous coefficient model and cited above. 


Example 7.9 Lip cancers; varying regression effects and heteroscedasticity This 
example considers the Scottish lip cancer data of Example 7.3, but using the varying 
coefficient model of Equation (7.27) as one alternative explanation of the observations. 
The model adopted for varying coefficients is 


log (0;) =a ej Bixi 


where x; is 0.1 times the percent of AFF workers. The other ‘spatial heterogeneity’ 
approach involves non-constant variances in the model 
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Table 7.6 Spatial expansion model (three chain run of 10 000 iterations, 
500 burn in) 


Total effects for areas 1—5 


Mean St. devn. 2.5% Median 97.5% 


Income 
1 —0.70 0.56 —1.74 —0.71 0.44 
2 —0.93 0.50 —1.88 —0.95 0.09 
3 —1.13 0.46 —2.00 —1.14 —0.21 
4 —0.98 0.47 —1.88 —0.99 —0.03 
5 —0.73 0.60 —1.83 —0.75 0.52 
House values 
1 —0.66 0.15 —0.97 —0.66 —0.37 
2 —0.56 0.13 —0.82 —0.56 —0.30 
3 —0.48 0.12 —0.71 —0.48 —0.25 
4 —0.56 0.13 —0.81 —0.56 —0.31 
5 —0.63 0.16 —0.95 —0.62 —0.32 


Base effects income (West to East then South to North effects) 


Vu 0.105 0.054 0.004 0.106 0.211 
Y21 —0.047 0.034 —0.112 —0.048 0.018 


Base effects house values (West to East then South to North effects) 


yi2 —0.042 0.021 —0.083  —0.043 —0.001 
Y22 0.027 0.013 0.001 0.028 0.052 


Non-spatial effects 


co 69.6 4.8 60.1 69.6 79.1 
c —3.45 1.77 —6.87 —3.29 —0.27 
rp) 0.16 0.66 —1.05 0.11 1.39 
Total effects 

Bi —1.95 0.38 —2.68 —1.94 —1.18 
B2 —0.12 0.12 —0.34 —0.12 0.12 


log (0;) =Q üi Bx; 


such that u; ~ N(0, 1/,;) and 6; = bk;, where k; ~ G(0.5v, 0.5v). 

A two chain run of 50000 iterations for the varying coefficient models (with conver- 
gence by 5000 iterations) shows a DIC of around 85.7 which is very close to that 
(namely 85.6) of the ICAR(p) model of Example 7.3. The coefficients on the AFF 
variable are highest in a belt of counties in the Highlands and on the Moray Firth, where 
variations in lip cancer mortality seem to follow the percents AFF quite closely: these 
are Nairn (county 13), Skye-Lochalsh (1), Inverness (19), Banff (2) and Ross and 
Cromarty (5). The lowest coefficients are in a set of counties in east central Scotland 
Gust north or east of Edinburgh), namely Clackmannon (43), Falkirk (42), Kirkcaldy 
(25), Dunfermline (26) and two counties in the far south west of Scotland, namely 
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Table 7.7 GWR analysis: varying effects of income and house value 


Mean St. devn. 2.5% Median 97.5% 


Average Intercept 68.9 0.7 67.5 68.9 70.2 
Income —1.60 0.05 -1.71 —1.60 —1.50 
House Value —0.24 0.02 —0.28 —0.24 —0.21 

Maximum Intercept 83.2 3.2 78.2 82.8 90.2 
Income —0.18 0.26 —0.60 —0.20 0.40 
House Value 0.19 0.09 0.04 0.17 0.40 

Minimum Intercept 49.1 4.6 39.3 49.4 56.8 
Income —2.78 0.24 —3.34 —2.75 —2.39 
House Value —0.72 0.07 —0.86 —0.71 —0.60 


Annandale (55) and Stewartry (32). In these two subregions lip cancer mortality is low or 
average, but the percent AFF varies widely, and Clackmannon has below average 
mortality but a percent AFF of 16. The average coefficient on the AFF variable 
is 0.51 with 95% interval (0.20, 0.87), higher than under the conditional autoregressive 
priors. 

The heteroscedastic approach employs a discrete prior with 28 bins on the value of 
v. There are greater prior concentrations of the grid at lower values of v. The posterior 
median for v is 18, suggesting heavier tails than in the Normal. Other aspects of 
this model are that it also enhances the AFF coefficient, with a two chain run of 
10000 iterations (1000 burn in) showing a posterior mean and 95% interval of 0.68 
(0.40, 0.96). There is a loss of fit (compared to the models in Example 7.3) with the DIC 
standing at 93, but various model developments such as adding a spatially correlated 
error may improve on this. The lowest k; are on Skye-Lochalsh, Annandale and 
Falkirk. 


7.6 CLUSTERING IN RELATION TO KNOWN CENTRES 


The above modelling strategies have considered whether measures for aggregate 
areas or individual outcomes exhibit a spatial structure without focusing on known 
cluster centres or seeking to identify unknown cluster centres. In environmental epi- 
demiology, it is frequently the goal to assess the degree of focused clustering in 
relation to one or more hazard sites. For instance, the existence or otherwise of 
clustering of child leukaemia cases near nuclear processing plants or hazardous 
waste sites has generated a range of possible models (Cocco et al., 1995; Waller et al., 
1994). The analysis may focus on individual disease events and the pattern of their 
spatial locations in relation to one or more hazard sites, or on aggregate disease counts 
for small areas, again in relation to point sources. In the absence of individual or 
aggregate measures of exposure, the distance from the source is often used as a proxy 
for exposure. 

For testing clustering of events around a specified point source (i.e. a putative hazard 
site), a number of testing procedures have been suggested, but may have limited potential 
to model simultaneously features such as the impact of covariates, the presence of 
overdispersion, or generalised spatial dependence in the outcome beyond that associated 
with exposure to the hazard source. The basis of a more general parametric modelling 
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strategy in relation to environmental hazard sites builds on the approaches set out by 
authors such as Diggle (1990), Diggle and Rowlingson (1994) and Morris and Wakefield 
(2000). More general modelling approaches that offer a synthesis with the work of 
Besag et al. (1991) and others in relation to unfocused clustering appear in Lawson 
(1994), Lawson et al. (1999) and Wakefield and Morris (2001). 

Diggle (1990) suggested that the relative risk for disease events at location s in relation 
to a point source at so could be represented as 


Ms) = pg(s)a(d) 


where d = |s — so| is the distance between the location and the point source, p is the 
overall region-wide rate, g(s) is the population at risk at location s, and a(d) is a distance 
decay function expressing the postulated decline in exposure at greater distances from 
the source, and preferably also the (possibly elevated) relative risk of disease at the 
source as compared to the ‘background’ level. 

For instance, possible distance decay functions, expressing monotonic lessening in 
risk as distance from the point source increases, involve the additive form 


ad)=1 nf(d) 
where f() might be a simple exponential function 
f(d) = exp(— ad) 


or a squared exponential 


f(d) = exp(— ad’) 


Taking a > 0, these functions have the property that a() tends to 1 (the background 
risk) as d tends to infinity. Also, 1 n can be interpreted as the relative risk at or close’? 
to the source itself (where d is near 0). If in fact n = 0, or effectively so, there is no 
association between risk and distance from the source. Another type of model is ‘hot 
spot’ clustering, when there is uniformly elevated risk in a neighbourhood around the 
focus, but background risk elsewhere. Thus, for tracts at distances d < 8 from the focus, 
theriskis1 n, but for d > ò, the risk is either just set at 1, or maybe follows a distance 
decay from the threshold distance: 


ad)=1 nexp(—ald—8f) 


Models where direction 0, not just distance, from the point source to the case event or 
tract is relevant (e.g. when prevailing winds influence the potential spread of pollution) 
might include terms in sin 0 and cos 0. If there is a peak risk away from the source, then 
a term in log(d) as well as d can be included. Lawson and Williams (1994) suggest the 
additive form 


a(d, 0)=1 nf(d, 0)/d 


so that risk still tends to 1 as d — co, but f() may include aspects of direction. 


'S With certain decay functions f(d) may not be defined at d = 0, and a minimum possible distance of 
residence from source may be defined and used instead. 
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7.6.1 Areas vs. case events as data 


In area studies with observations consisting of disease counts Y;, the population risk in 
area i might be approximated by the expected disease total £;, based on the population 
age structure in that area, and d might be the distance between the area centroid and the 
point source. Additionally, to account for the more usual sources of overdispersion or 
spatial correlation, the model for the Poisson mean p; might include unstructured 
effects u; and spatial effects e; as in the preceding sections. If area characteristics x; 
(such as deprivation) also influence the disease risk, one has a model such as 


u; = pEjexp(Bx [1 f(d)lexp(u; ei) 
or 
log(u;)=logp log(E) Bx, log[l f(d)] ui ei 


Often, observations consist of a set of cases and their locations. On the assumption that 
individuals are independent (in spatial terms) with regard to risk of disease, possibly 
after allowing for relevant risk factors or confounders, one may, however, model the 
likelihood of case events i = 1,.. at locations (s;) in relation to a point source at so. 
The population density may be modelled via kernel methods, typically using aggregated 
administrative data for districts within the region. An alternative is to proxy the 
population distribution using a control disease unrelated to the risk from the point 
source (e.g. cardiovascular conditions may be assumed independent of residence near 
high voltage electricity lines, but certain cancers may not be). 

Let the nı disease cases be ‘successes’ in terms of a Bernoulli process, with y; = 1. So if 
the m control disease cases (with y; = 0) have locations s; i=n; 1,..m, m, the 
probability of caseness m; can be modelled in terms of the decay function, and possibly 
individual risk factors x; (which are confounders in terms of the main interest in the 
relation between risk and the pollution sources). Thus, 


yi ~ Bernoulli(a;) 

with 

Tmi/ — mi) = p*[l f(d exp (Bx) 
This conditional approach (Diggle and Rowlingson, 1994; Lawson, 2001, p. 44) has the 
advantage of not requiring an estimate of g(s). Note, however, that the interpretation of 
p* depends upon the selection of cases and controls; specifically, p* = (a/b)p, where a is 
the sampling proportion of cases (often 100%), b is the sampling proportion of controls, 
and p the population odds of disease (Diggle et al., 2000). 
7.6.2 Multiple sources 


Often there are multiple sources of environmental pollution, and assuming each site 
contributes to the risk independently of others one may take (Biggeri and Lagazio, 1999) 


u(s) = pg(s)a(d, ©) = pg(s) f X nsa, ao) 
k 


For instance, one option is 


u(s) = pg(s)a(d, ) = pg(5) i Y ne exp(— ard ] 
k 
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though often either or both of n, and a, are taken equal. Morris and Wakefield (2000) 
propose a slightly different formulation, with 


u(s) = pg(s)a(d, p) = pgs) ][ [1 ne exp (— ad] 
k 
Lawson (2001) discusses adaptation of the multiple source model to the case where both 
the geographic location of the cluster centres and their number are unknown. Other 
issues are relevant in multi-site analysis, for example, proximities between sites them- 
selves: risk from two sources relatively close to one another may overlap. 


Example 7.10 Leukaemia cases and hazardous waste sites A focused modelling 
analysis for area disease counts is illustrated by cases of leukaemia during 1978-82 in 
790 geographic tracts in upstate New York, as considered by Waller et al. (1994) in 
relation to 11 hazardous waste sites. The observed data in fact contain decimal parts 
because some cases could not be allocated with certainty to particular cells; the observed 
O; vary from 0 to 9.29. For illustration, the analysis is based on counts obtained as the 
integer parts of O; 0.5, so Y; = Int[O; 0.5]. 


A model including a distance decay function is essentially a departure from random- 
ness, where randomness implies that each person in the population has the same chance 
of a disease regardless of their place of residence (Waller et al., 1994). Under random- 
ness and without allowing for possible overdispersion, Y; ~ Poi(pP;), where P; is the 
population of the ith tract. Suppose, instead, the expected counts under randomness are 
multiplied by an additive or multiplicative function for relative risk as a function of 
distance from, or contiguity to, the focus. 

For instance the so-called clinal clustering model assumes that 


w; = pP: n fd] (7.28) 


where a(d;) is a distance decay function in terms of the distance of tract i from the 
source. To allow a ‘test’ of the null model 


Wi = pP; 


against the full model in Equation (7.28), one may introduce a regression selection 
indicator, H ~ Bern(0.5). Then, the likelihood involves the mean function 


m; =pP[l Hn f(d)] 


and the probability for the full model is the posterior probability that H = 1. 

Here this approach is first carried out separately for each of the 11 sites, using the 
function f(d) = exp(— ad). Note that this involves possible issues of adjusting for 
‘multiple tests’. This analysis employs the device of replicating the data 11 times and 
fitting a different Poisson model to each replicated data set, such that the mean for the 
kth analysis is based on distances from the kth site (Model A in Program 7.10). Since the 
sites are all waste disposal sites, and a single disease risk is under consideration, a single 
parameter a seems plausible to model the distance decay effect over the 11 sites (taking 
a, = a means that strictly the ‘tests’ are not separate). The excess risk n, is allowed to 
vary by site, and eleven Bernoulli variables Hy model the chance that n% is positive. So 
the mean for the Ath analysis is 


Wi =pPill Heng exp ( — adix)] 
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Informative G(1, 1) priors placing low weight on large values are adopted for a and ng. 
A relatively diffuse G(1, 0.001) prior is taken for p. A two chain run of 2000 iterations 
(with convergence at 250) yields probabilities on the full model in Equation (7.28) 
similar to those implied by the classical hypothesis test procedure of Waller et al. 
(1994, p.20). Specifically, the averages of the statistics 1 — Pr(H = 1) are intermediate 
between the significance test probabilities cited in Table 4 of Waller et al. (1994) for 
Stone’s test and Rao’s test (see also Lawson, 1993). The latter test is based on the sum of 
the deviations of the observed disease incidence in each tract from its expectation pP; 
under the null model, weighted by the exposure f(d ), where in Waller et al. (1994) and 
Waller and Poquette (1999) the latter is set as f(d) = 1/d. 

Thus, the probability that H = 1 exceeds 0.995 for the first two sites (Monarch 
Chemicals, IBM Endicott) and is around 0.98 for the third, fourth and seventh sites 
(Singer, Nesco, Smith Corona). The parameter a is around 0.09 with 95% interval 
{0.04, 0.14}. The results here are based on allowing a free parameter in the function 
f(d), and one could follow the Waller et al. approach more literally by setting f (d) = 1/d, 
though this would affect the estimates of the excess risk at the source (i.e. ofn = RR — 1). 

A global test of risk from one or more waste sites is implemented via the multi-site 
model 


w; =pP;|1 HS) 1, exp(—ad) 
k 


with H ~ Beta(0.5). The probability that the Bernoulli indicator H is 1 is equivalent to 
the probability that there is risk from one or more sites. In fact!®, there is a negligible 
probability that Pr(H = 0), and so clear evidence of risk from one or more sites. The 
estimate of the distance decay parameter is higher under this model, at around 0.30, and 
the n, vary from 1.88 (site 7) to 0.27 (site 10). Some relative risks are clearly more 
elevated than others, with 2.5% points for n, away from zero in the first, fifth and 
seventh sites (Table 7.7). 

The Poisson mean might be altered to a form with site specific selection indicators Hy 


u; = pP;} 1 So Ain exp ( — ad) 
k 


to account for differential risks by site. Applying this approach’? in Model C in 
Program 7.10, it appears that the probabilities Pr(H;, = 1) are highest for sites 1, 5 
and 7 (the probabilities are between 0.95 and 0.97), though not for site 2 (IBM Endicott) 
as in the ‘independent tests’ model A. The close proximity of sites 1 and 2 may be noted, 
and this may affect its estimated risk in a multi-site model. 

Possible extensions of these forms of model are to add unstructured or spatially struc- 
tured errors for the 790 tracts, or if they were available, to add tract covariates (e.g. income). 


Example 7.11 Larynx cancer in Lancashire An example of an individual level out- 
come, we consider event case data consisting of 57 larynx cancer cases that occurred in a 
part of Lancashire over 1974-83. The controls are provided by 917 lung cancer cases in 
the same study region. The point source implicated in the disease is a waste incinerator 


'6 As judged from a two chain run of 2000 iterations, and convergence by 250 iterations. 
17 Estimates of n; in this model should only be based on the iterations where Gy = 1, since when Gy = 0 nọ is 
just a sample from the prior. 
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Table 7.7 Estimates of distance and site risk parameters; 
Global risk model 


Mean St. devn. 2.5% Median 97.5% 


a 0.302 0.102 0.146 0.289 0.546 
nı 1.478 0.814 0.209 1.361 3.404 
n2 0.543 0.415 0.023 0.455 1.541 
N3 0.714 0.598 0.024 0.576 2.198 
N4 0.754 0.688 0.020 0.565 2.547 


Ns 1.168 0.507 0.321 1.123 2215 
Ne 0.907 0.888 0.021 0.636 3.380 
n7 1.884 0.854 0.396 1.822 3.786 


Ng 0.977 0.846 0.035 0.758 3.211 
No 0.796 0.771 0.027 0.571 2.797 
Nio 0.273 0.234 0.010 0.213 0.855 
Ni 0.866 0.772 0.026 0.655 2.963 


at location (355000, 414000). Although frequently analysed, this data set has a relatively 
small case total, and firm identification of exposure effects in terms of (say) distance 
decay may be difficult to obtain. 

As above, the intensity model is 


Ms, d) = p*g(s{1 f(D} 


or with individual confounders x; 
A(s, d) = p* exp (Bx) fd] 


but in the conditional logistic model (Diggle and Rowlingson, 1994), the population 
intensity is ‘conditioned out’. In the present example, there are no confounders. So if 
Y; = 1 for larynx cancers and Y; = 0 for lung cancers, the simplest sampling model is 
Bernoulli, with 


Y; ~ Bernoulli(z;) 
wi/1—7i)=p*[l f(a] 


Here terms in d and log(d) are introduced in f() to allow for peaked exposure and 
incidence away from the source. Additionally, directional measures are introduced, 
namely terms in cos(8) and sin (0), where 0 is the angle (in radians) between source 
and case/control event. For instance an event due North of the source would have 
angle 90° and radian value 0 = 1.57, while an event due South of the source would 
have angle and radian of 270° and 4.71, respectively. 

An initial ‘variable selection’ stage involves setting binary selection indices G; to G4 
on the coefficients a; to a4 in the function 


a(d,9)=[l exp{—aid a log(d) a3cos(®) agsin(6)}/d] 


where a, az are both taken as positive (Lawson and Williams, 1994) and N(0, 1) priors 
constrained to positivity are adopted. For a3 and a4 N(0, 10) priors are assumed. 
A single chain of 10000 iterations show selection rates slightly above 0.5 for a; and 
a3 by comparison to prior rates of 0.5 under the priors G; ~ Bern(0.5), whereas az 
and a4 have selection rates below 0.5. So there is (rather weak) evidence that these are 
potentially more important predictors of spatial variations in risk and incidence. 
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Investigations show that fitting a model with both cos (0) and linear distance is not 
well identified. Predictive methods (e.g. Gelfand and Ghosh, 1998) are then used to 
assess the alternative single predictor models 


a(d, ®)=[1 ng exp {B cos (6)} /d] 
a(d, 0) = [1 Na EXP { aE ad)}/d] 


Note that the first model has a (preset) distance decay effect by default. In fact, this 
model identifies a reasonably well defined negative effect of B, or raised risk in the NW 
and SW quadrants, after taking account of the preset inverse distance effect. The 
parameter ng is higher than ną, with posterior median of 0.56 as against 0.22 for ną- 
However, the models are comparable in terms of predictive fit, both around 82.5. 

Figure 7.3 is a kernel plot of the relative risk effect under Equation (7.29), namely 
Ng = RR- 1, and shows it to be distinct from the zero null value, so apparently 
identifying a real effect. Figure 7.4 shows the posterior density of the directional 
parameter. 


(7.29) 


7.7 SPATIO-TEMPORAL MODELS 


The above applications have considered spatial outcomes without reference to other 
dimensions, but the evolution over time of dependent variables for spatial outcomes has 
relevance in several contexts. For instance, tests for clustering may be extended to 
include both time and space (e.g. Rogerson, 2001), and prediction from panel data of 
area data may be extended to include spatial correlation (Baltagi and Dong, 2000). If 
age and cohort are included with time, then a broad class of spatial Age-Period-Cohort 
(APC) models may be used in modelling and forecasting mortality or disease incidence 
by area (Schmid and Knorr-Held, 2001). The focus here only illustrates the potential for 
Bayesian modelling, and considers especially ecological regression through time, and 
spatio-temporal models for innovation. 


+ 
9-9 
9-8 


Density 
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Figure 7.3 Plot of distance decay parameter 
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Figure 7.4 Kernel plot of directional parameter 


In terms of ecological regression, Carlin and Louis (1996) discuss analysis of counts 
of deaths or new disease cases by area i and demographic subgroups j (age, sex, 
ethnicity) which are recorded at regular intervals ¢. This would typically generate 
binomial or Poisson outcomes in relation to populations at risk Pj;. Thus, possible 
specifications are 


Yin ~ Bin(rijr, Pit) 
and 
Yi od Poi(E jr Pj) 


with the expected cases E; based on applying standard rates (e.g. national rates) to the 
area populations P;;. The model for the Binomial rates rj, or relative risks ,,, and 
hence the priors on the parameters of that model, would be specified in terms of two 
major criteria, namely the desire to stabilise rate estimates by smoothing the crude 
estimates, and the need for the model to reflect both spatial and temporal dependence in 
the observations or regression errors. 

Consider a model with just areas and time, and with the impacts of different demo- 
graphic structures controlled for by standardisation. Thus, 


Yy ~ Poisson(Ejr;,) 


and a simple model for the log relative risk Z; = log (;,) would be as a sum of constant 
‘unstructured’ effects u;, spatially dependent effects e; for instance with an ICAR(p) 
form as in Equation (7.12), and time parameters 6,. 

For a short panel of observations with an ICAR(p) model for e; and with 


Zr=w Ui e 9; 


the 5, may be best modelled as fixed effects, with 6; = 0 for identifiability. Alternatively, 
the intercept p can be omitted, and the 6, as free parameters will then detect any 
trend. 
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If an ICAR(1) form is used for the e;, then it will be necessary to assume both p = 0 
and 6; =0 unless the sampled values of e; are centred at each iteration; further 
questions of identifiability are considered by Waller et al. (1997). To allow for unstruc- 
tured space-time interactions W,,, one might extend the relative risk model (assuming an 
ICAR(p) model for e;,) to 


Z= u ei Vi 


where the y, are distributed about time specific means 6,. 


7.7.1 Space-time interaction effects 


In practice, more complex issues occur as discussed, for example, by Knorr-Held and 
Besag (1998), Gelfand et al. (1998) and Sun et al. (2000). For example, it may be that the 
relative balance of unstructured random variation as against spatial clustering is 
changing over time. This might suggests a model with clustering differentiated by 
time, such as 


Zit =e Ui Nit 
where the spatially correlated risk ny has a time specific variance k? (Carlin and 


Louis, 1996). Modelling options for spatially structured area-time interactions nj 
are very wide, since autocorrelations over areas may be combined with those 


over time. For example, defining N; = (Nia, No. ----Nn) and a white noise errors 
v; = (vi, Vi2, ----Vnt), then instead of a simple cross-sectional spatial correlation 
n=AWnH, vi 


one may have one or more time lagged spatial correlation effects, such as 
N=MWy, MWe vi 


Lagazio et al. (2001) assume spatial effects are described by contiguity and describe a 
‘random walk’ scheme in space and time, involving smoothing forward as well as back 
in time, whereby 


Ny ~ NOSIM 1 Ni eal Ne? Soni = Son, t-1 Soniye 1}, «°/[2Ni)) 
j~i jmi a 


where j ~ i denotes those N; areas of i which are its first order neighbours. 


7.7.2 Area level trends 


Another issue centres on modeling trends in the outcome (e.g. mortality) relative, say, to 
national levels. Such trends may be different between areas. A linear trend, uniform 
across all areas, can be included in a model such as 

Zi =e Ye Ui ej 


To allow for differentiated trends between areas with some falling more than the 
national or regional trend, some much less, one might specify a random growth rate. 
Thus following Bernardinelli et al. (1995), 


Zin =e YÍ Ui ei (7.30) 
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where y; may be taken as unstructured growth rates with overall average growth rate y. 
However, if trends are expected to be differentiated in a spatially distinct pattern (e.g. if 
the largest relative falls in the outcome are spatially clustered), then the y; might be 
assumed to be spatially dependent, for example with ICAR(p) or ICAR(1) form and 
with variance o2 (see also Sun et al., 2000). 

Temporally persistent differences in the outcome may be important. For example, 
persistent mortality differences across demographic, cultural, regional and groups have 
been noted by authors such as Kerkhof and Kunst (1994). Even when there is some 
shuffling of spatial relativities in the outcome over time, one may nevertheless ask how 
persistent differentials across areas are, for example, via introducing a temporal auto- 
correlation in the risks. This may be expressed in a panel model such as 


Zin =M Wy Ui ei 
Wire = PPr- Vir 

for time periods ¢ > 1, where the v; are white noise errors, and 
Zi=we Vin ui ei 


for the first period. For stationarity, the temporal correlation parameter p is constrained 
to be between —1 and 1. The h;, for t > 1 have mean pis,,_, and precision t,. The , 
are assumed to follow a distinct prior with mean 0 and precision 7). 

Differences in persistence between areas under either model could be handled by 
making persistence an area specific random effect p;, either spatially structured or 


unstructured. Thus, 
Zi = i U & 
=e Wi (7.31) 
Wir = Pi Pii Vir 


for time periods ¢ > 1, and 
Za=Ņp Vn Ui êi 


for t = 1. If the p, are assumed spatially uncorrelated, then their prior density could be 
of the form p; ~ N(0, r) where o? represents variations in persistence. 


7.7.3 Predictor effects in spatio-temporal models 


Finally, variability in relative risks over both space and time may be caused by changing 
impacts of social and other risk variables. Trends in the impact of a time-specific 
predictor X; may be modelled via 


Zit =B Xr M u ej 


with B, either fixed or random effect (e.g. modelled by a first order random walk). 
A model with both area and time dimensions also allows one to model differences in the 
importance of explanatory variates between areas, for instance via a model such as 
(Hsiao and Tahmiscioglu, 1997) 


Zi =BXv Òr Ui ei 


and this may be achieved without recourse to the special methods of Section 7.5. 
However, models with regression coefficients which are both time and spatially varying 
as in 
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Zit = Bi Xit Ò Uui ĉi 


would suggest using GWR or multivariate prior methods, as discussed earlier in 
Section 7.5. 


7.7.4 Diffusion processes 


Behavioural considerations may also influence the model form. In both medical geog- 
raphy and regional economics, diffusion models have been developed to describe the 
nature of spread of new cases of disease, or of new cultures or behaviours in terms of 
proximity to existing disease cases or cultural patterns. Here the model seeks to describe 
a process of contagion or imitation, namely of dependence among neighbouring values 
of the outcome, and error terms may possibly be taken as independent, once the 
autodependence in the outcomes is modelled satisfactorily. To cite Dubin (1995) in 
the case of adoption of innovations by firms, ‘just as a disease spreads by contact with 
infected individuals, an innovation becomes adopted as more firms become familiar 
with it, by observing prior adopters’. If the outcome is absorbing, or at least considered 
irreversible for modelling purposes, then the process of diffusion continues until all 
potential subjects exhibit the behaviour concerned, and the process approximates 
S-shaped or logistic diffusion. 

Dubin considers a dynamic logit model of diffusion to describe the adoption of 
innovations by firms or entrepreneurs. The observed outcome is binary Yp, equalling 
1 if the innovation is adopted by firm i at time t, and Yy = 0 otherwise. Note that it is 
not possible for Yi,; p (k > 0) to be zero of Yj; is 1. This outcome is generated by an 
underlying utility or profit level Y;,*, which is a function of the firm’s characteristics and 
its distance from earlier adopters: 


Yie* = XnB 5 Pi Y, t-1 Ui (7.32a) 
J 


where Y;;* is the expected profit from the innovation and the u; are without spatial or 
time dependence. The influence p,; of prior adopters is a function of inter-firm distance 
dj, such that 


Pi = &1 Exp (— ardjj) (7.32b) 


with p; = 0. Here a; expresses the impact on the chances of adopting an innovation of 
the presence of adjacent or nearby adopters (with dj small), and a2 > 0 expresses the 
attenuation of this impact with distance. 


Example 7.12 Adoption of innovations In a similar way to Dubin (1995), we simulate 
data over T = 5 periods for 50 firms using a grid of 100 x 100 km, so that the maximum 
possible distance between firms is 140 km. The first period data are generated with 


Yi* =-3 2X1; om 3x9; (7.33) 


where x; and x2 are standard normal. Grid references (s1;, s2;) are randomly selected 
from the interval (0, 100). 15 out of the 50 firms are adopters at time | (i.e. have positive 
Y; *) using this method. At the second stage the observations Yj, at times ¢ > 1, are 
generated according to the ‘profits model’ in Equation (7.32), taking a; = 0.5 and 
az = 0.02 and distances defined between pairs (51;, 52;). In this example, if it is known 
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that Yı = 0 then Yo = 0 necessarily, so there are partly observed ‘initial conditions’. For 
firms with Y; = 1, one may simulate Y;ọ using an additional parameter mo. 

Then with the 50 x 5 simulated data points, the goal is to re-estimate the generating 
parameters. A beta prior B(1, 1) is adopted for mo, and N(0, 1) priors on the parameters 
a; of the influence function p; (with only positive values allowed). Non-informative 
priors are taken on the parameters of B. On this basis, a two chain run of 2500 iterations 
(and convergence from 250) shows the B parameters are re-estimated in such a way as to 
correspond to those in (7.33), and the distance decay parameter a is also closely 
reproduced (Table 7.8). 

Especially in processes where transitions from | to 0 as well as from 0 to 1 are 
possible, one might also consider space and time decay dependence via functions such as 


Pit = 1 exp ( a ard = azt) 


with the sign and size of a3 reflecting the path of the process over time. 

In the present application (despite knowing that the mode of data generation assumes 
az fixed over time), one may more generally envisage distance decay varying over time, 
since the balance between adopters and non-adopters, and hence the spatial distribution 
of the two categories, changes through time. Therefore, an alternative model here takes 
a2 to be time varying with 


Yir* = XirB 5 Pijt Y; -1 Ui 
J 


Pije = &1 EXP (— azdi) 


In fact the predictive loss criterion (footnote 5) with w = 1 shows the constant decay 
effect model to be slightly preferred with D.z at 22.1 compared to 23.6 under time 
varying decay. The coefficients œv; are much higher in the first two periods'®, with 
means 0.8 and 0.33 (though both parameters have skewed densities), as compared to 
means for the last three periods of 0.041, 0.035 and 0.031, respectively. The a; coeffi- 
cient is elevated to 0.9. A suggested exercise is to fit the model with both distance 
function parameters time varying: 


Pijit = O11 EXP (— azdi) 
Example 7.13 Changing suicide patterns in the London boroughs This analysis con- 


siders spatio-temporal models in a disease mapping application with event count data. 
Specifically, the focus is on trends in period specific total suicide mortality (i.e. for all 


Table 7.8 Profits model for spatially driven innovation 


Mean St. devn. 2.50% Median 97.50% 


ar 0.58 0.20 0.25 0.55 1.05 
az 0.019 0.009 0.003 0.019 0.037 
Bi —3.10 0.61 —4.37 —3.07 —1.99 
B2 1.94 0.35 1.30 1.93 2.65 
B3 —3.28 0.53 —4.37 —3.26 —2.32 


To 0.032 0.030 0.002 0.026 0.103 


I8 Convergence occurs later in this model, at around 500 iterations in a two chain run of 2500 iterations. 
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ages and for males and females combined) for the 33 boroughs in London and eight 
periods of two years each between 1979 and 1994. Expected deaths are based on 
national (England and Wales) age-sex specific death rates for each period. There is 
evidence for changing suicide mortality in London over these periods relative to 
national levels, and for shuffling of relativities within London. Thus, London’s suicide 
rate seemed to be falling against national levels with an especially sharp fall in central 
London areas. 


Here two models among those discussed by Congdon (2001) are considered, namely 
model (7.30), allowing for differential trends to be spatially structured, and a differential 
persistence model as in Equation (7.31) with unstructured borough persistence effects p;,. 
The latter model was estimated subject to the assumption of stationarity in the individ- 
ual borough autocorrelations. Fit is assessed by the pseudo marginal likelihood and 
via the Expected Predictive Deviance (EPD) criterion. This is based on ‘predicting’ new 
or replicate data from the posterior parameters. The better the new data match the 
existing data, the better the fit is judged to be (Carlin and Louis, 1996, Section 6.4). 
Specifically, Poisson counts Yi;new are sampled from the posterior distribution defined 
by the predicted Poisson means and compared with the observed Y; via the usual 
Poisson deviance measure. As for the usual deviance, the EPD is lower for better fitting 
models. 

Estimation of the differential growth model in (7.30) using a two chain run of 5000 
iterations shows an average decline of 0.02 in the relative suicide risk in London 
boroughs in each period (measured by G <- mean(s[]) in Model A in Program 7.13). 
The steepest decline, at 0.054 per period, is in the central London borough of Westmin- 
ster (area 33). 

For the differential persistence model, the posterior estimate of o, at around 0.65 
(with 95% interval from 0.5 to 0.85) suggests some variation in the extent to which 
suicide risk within boroughs is temporally autocorrelated. The average correlation is 
around 0.44, but it varies in individual boroughs from —0.15 to 0.92. The highest 
continuity was in low suicide suburban boroughs which tended to remain at low levels 
throughout the span of the study, and in central London boroughs where suicide 
remained consistently high, though the excess above the overall average was falling 
through time. 

In selecting between the linear growth model and the persistence models, there is 
some conflict between the model choice measures. Simple fit (deviance and EPD) 
unequivocally favours the latter, with the mean EPD at 525 compared to 584 for the 
differential growth model. However, the marginal likelihood based on totaling the CPO 
criterion over all periods and areas shows the differential growth model with a marginal 
likelihood around —946 as against —963 for the spatial persistence model. A suggested 
exercise is to assess the effective parameters in both models and use the DIC criterion as 
a further method of selection. Further model options are to assume spatially correlated 
persistence, and to relax the stationarity assumption on borough autocorrelations. 


7.8 REVIEW 


The chapter has sketched some of the major spatial applications where Bayesian ideas 
and modelling principles have been important and beneficial, and the growing role of 
Bayesian spatial modelling is closely linked to improved and simplified estimation via 
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MCMC methods. It seems inevitable that Bayesian approaches will play a central role in 
future developments, in areas such as 


(a) direct modelling of spatial covariance matrices, drawing on geostatistical techniques 
but applying them in new areas (e.g. disease mapping); 

(b) spatial models that allow for discontinuities or outliers, developing on the work of 
authors such as Lawson and Clark (2002) and Lesage (2000); 

(c) models that allow for spatially varying impacts of covariates, including single 
equation regression methods or extensions of multi-regression GWR approaches; 

(d) spatial model assessment based on cross validatory principles (Congdon, 2002b); 

(e) focused cluster models where the outcome is ordinal (e.g. severity of disease) or 
multivariate; 

(f) spatial and space-time models in multivariate outcomes. 
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7.9 EXERCISES 


1. In Example 7.1, fit the spatial errors model using the weights based on distan- 
ce/shared boundary rather than contiguity. How does fit (e.g. via the pseudo 
marginal likelihood based on CPOs) compare to that cited previously? 


2. In Example 7.2, fit the spatial autoregression model (7.5) under both Student ¢ and 
homoscedastic Normal errors u, and assess inferences in terms of the impact of 
outlier areas. 


3. In Example 7.3, compare the fit of the four models considered using the deviance 
information criterion of Spieglehalter et al. (1998). 
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4. In Example 7.3, fit the ICAR(1) model including unstructured as well as spatial 
errors and under the options with and without the AFF variable. How does the ratio 
of marginal variances (spatial to unstructured variances) change? 


5. Fit the Leyland et al. spatial model (see Example 7.4) to the lip cancer data of 
Example 7.3. How does this affect the coefficient for the occupation variable? 


6. The following are 400 microwave backscatter measures in a large field in East 
Anglia, UK, on a 20 x 20 grid with 12m intervals between points: 


22 39 38 67 46 1 46 73 95 51 92 68 33 64 77 73 82 54 61 97 

43 36 58 32 35 35 88 56 73 64 74 62 55 73 105 80 90 21 57 100 

45 15 34 28 35 89 113 45 54 44 49 59 94 58 85 112 77 83 48 75 

48 14 14 83 55 81 73 33 58 74 81 78 66 96 69 102 94 86 96 80 

18 24 51 88 65 41 47 20 46 94 76 97 73 94 41 95 100 62 101 75 

66 53 40 50 29 19 67 50 57 85 57 88 69 106 88 79 98 69 82 95 

113 50 24 11 50 64 90 80 63 68 52 68 79 90 74 40 73 87 100 86 

103 31 45 29 56 115 132 78 82 54 70 51 40 69 80 71 88 95 94 68 

83 70 104 81 85 120 186 87 85 86 34 30 49 61 103 99 94 87 90 104 
72 78 186 132 111 153 157 119 90 82 80 95 79 57 79 57 68 105 63 99 
82 164 157 138 136 155 157 109 90 104 115 101 86 36 98 66 57 73 94 97 
107 166 157 101 101 93 157 85 82 80 111 82 77 104 67 70 79 57 76 81 
105 87 103 64 58 77 97 87 81 65 85 83 44 97 67 98 99 50 83 81 

145 116 91 94 34 49 73 64 46 83 60 69 90 89 63 89 84 67 47 86 

135 100 115 100 46 44 81 51 27 81 81 62 95 96 67 82 77 68 79 80 

78 127 115 90 47 92 103 79 58 95 54 70 93 105 68 100 39 65 98 96 

76 140 100 95 63 78 98 83 66 56 79 71 92 85 87 92 97 74 102 78 

82 119 69 83 78 40 73 81 77 93 61 63 78 62 87 61 76 75 98 23 

71 53 58 79 64 83 66 64 94 89 24 72 90 68 55 82 92 81 63 64 

86 63 78 84 67 78 63 71 91 60 80 54 80 98 65 80 99 61 73 37 


Group the 400.399/2 pairs of points into bands according to separations 0-12,12-24, 

. etc. up to a maximum distance band, and derive the necessary statistics for a 
variogram (this may be easier done outside BUGS). Then fit exponential and 
spherical variograms to the series, as in Example 7.5. 


6. In Example 7.7, compare the fit of the two models using the DIC criterion. How does 
this compare to the Gelfand-Ghosh criterion? Employing both criteria, fit a model 
where aj is a power function, namely ay = d;“. 


7. In Example 7.8, add code to find which areas have the maximum coefficients under 
the GWR model. How does this add to knowledge about possible outlier areas? Also, 
consider the effect of taking Student ¢ errors with known degrees of freedom (e.g. 
v= 4). 


8. In Example 7.9, combine the heteroscedastic error model with an ICAR(p) spatial 
error. How does this affect the AFF coefficient and the fit via DIC? 


9. In Example 7.11, fit the three parameter decay and direction function 


f(d, 0@)=[1 n exp {P cos (8)}/d"] 


with a positive. 
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10. In Example 7.12, fit the time varying spatial effect 


Pij = a1: EXP ( — azdi) 
and assess fit against the constant parameter model 
P; = a exp ( — andj) 


11. Fit the spatio-temporal model for London borough suicides of Example 7.13 
assuming spatially correlated borough specific persistence (temporal correlation) 


parameters. 
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CHAPTER 8 


Structural Equation and Latent 
Variable Models 


8.1 INTRODUCTION 


Structural equation modelling describes multiple equation models that include latent or 
unmeasured variables (‘factors’ or ‘constructs’) for which multiple observed indicators, 
continuous or discrete are available (Bollen, 1998), and that allow for measurement 
error in the latent constructs. They have found a major application in areas such as 
psychology, education, marketing and sociology where underlying constructs (depres- 
sion, product appeal, teacher style, anomie, authoritarianism, etc.) are not possible to 
measure directly. 

Instead, the observed data on a large number of indicators are used to define (i.e. 
serve as proxies for) the underlying constructs, which are required to be fewer in number 
than the observed variables to ensure identifiability (Bartholomew, 1984). Depending 
on the application the observed variables may be known as items (e.g. in psychometric 
tests), as indicators or as manifest variables. One might assume that conditional on the 
constructs, these observed indicators are independent — so that the constructs account 
for the observed correlations between the indicators (this is often called the ‘local 
independence’ property). However, as Bollen (2002) points out, this is not an intrinsic 
feature of structural equation models. 

Structural equation models include both measurement models, or confirmatory 
factor analysis models, confined to representing the constructs as functions of the 
indicators, and fully simultaneous models allowing interdependence between the con- 
structs. The canonical structural equation model takes the LISREL form (Joreskog, 
1973), with a model relating endogenous constructs W to each other and to exogenous 
constructs € and a measurement error model linking observed indicators Y and X to the 
latent variables. Thus for the ith subject, the structural model is 


y=A By GE wi (8.1a) 


where ; is a p x 1 vector of endogenous constructs, €; is a q x 1 vector of exogenous 
constructs, w; is a p x 1 vector of errors on the endogenous constructs, B is a p x p 
parameter matrix describing interrelations between the endogenous constructs |, G is a 
p x q parameter matrix describing the effect of exogenous on endogenous constructs, 
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and A is a p x | intercept. In any particular model, there may be no € variables, and 
some of the B and G coefficients are typically assigned zero or other default values. The 
links between observed indicators and the constructs are defined by the measurement 
model or models: 


Y;=k;, Ayp; uy (8.1b) 


Xi=ks Aë uX (8.10) 


where Y; isan M x 1 vector of indicators describing the endogenous construct vector W, 
and X; is an L x 1 vector of indicators that proxy the exogenous construct vector &; Ky 
and kx are M x l and L x | intercepts; and A, and A, are M x p and L x q matrices of 
loading coefficients describing how the indicators determine the scores of an individual 
on the latent constructs. Typically, restrictions are applied on some loadings to ensure 
identifiability. A particular analysis may involve just a measurement model, in which 
case the distinction between different types of indicator is not relevant, and the meas- 
urement model reduces to 


Y=“ Ab, u (8.14) 


The focus in structural equation models is on confirming the nature of the underlying 
constructs as postulated by substantive theory, or on testing causal hypotheses based on 
theory, rather than on exploratory data analysis. Thus, confirmatory factor analysis, a 
particular from of structural equation model, differs from exploratory factor analysis in 
postulating a restricted loading structure in which only certain loadings are free to be 
estimated (see Section 8.2). The nature of the linkages from the latent variables to the 
observed indicators is often defined on a priori theoretical grounds, so the ‘prior’ of a 
Bayes model for a confirmatory factor analysis may have a large subject matter element 
and is to that extent clearly ‘informative’. 

As an example of this type of model and of the identifiability constraints on the 
parameters, suppose five indicators Y,,.., Ys are taken to be measures of two con- 
structs , and ,. Further, suppose indicators Y1, Y2, Y3 (e.g. measures of spatial 
ability) have loadings \1, 42 and à; on construct w, while indicators Y4 and Ys (e.g. 
measures of linguistic skills) have loadings \4, N5 on ,. However, there are no loadings 
of Yı, Y) and Y; on W, or of Y4 and Ys on y. This is a hypothesized structure and one 
open to assessment: it may be that some indicators in fact show significant ‘cross- 
loading’, with Yı showing a non-zero link to , for example. It is also open to question 
whether , and s, (spatial and linguistic ability) are correlated or orthogonal. 

Since , and W, have arbitrary location and scale, one option to gain identifiability is 
to define them to be in standard form, with zero means and variances of unity (Bentler 
and Weeks, 1980). If correlation between the two constructs is allowed, then they might 
be taken to be bivariate normal with variances 1 in the diagonal of the covariance 
matrix, but with an off-diagonal correlation parameter. Note, though, that there is no 
necessary restriction to assuming the constructs in a structural equation model to be 
Normally distributed. 

Under this predefined scale option the loadings ^1, ^2 and à; relating Yı, Y) and Y3 
to 41, and hy and As relating the verbal test scores to 5, are all then free parameters. So 
letting |; = (Yi, Y2), one might have 


w~m([o} l 4) 
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Yi =k, MỌ u 
Yo, =K2 Mpi Mhi 
Yi =K; MỌ Ui 
Yy = K4 MỌ; Ui 
Ys; = K5 Aspo Usi 


Since the constructs are intended to summarise information in the indicators relatively 
informative priors, e.g. N(1, 1) or N(0, 1), may be used for the loadings; for example, see 
Johnson and Albert (1999) in the context of item analysis. Often a scaling of the 
observed indicators (e.g. centred or standardised X and Y) is useful also in identifica- 
tion. Note that p is identifiable in the above example, because all cross loadings have 
been set at zero; for example, there are no loadings of Yi, Y2 and Y3 on Wy. 

In general, if the ,,; are taken to be uncorrelated and to have variance 1, and all 
M.p = 10 possible loadings are of interest, then p(p — 1)/2 = 1 restrictions are re- 
quired for identifiability (e.g. Everitt, 1984). If \,,; is the loading on the mth indicator 
on the jth factor, then one might for example impose zero, unity or equality constraints 
(for the above these might be \s; = 0 or Ay; = ^42). Note that a Bayesian approach may 
circumvent the need for exact constraints (see below). 

An alternative parameterisation fixes the scale of the constructs by selecting one 
loading corresponding to each factor — here one among the loadings {1, M2, 43} and 
one among {4, As} — and setting them to a predetermined non-zero value, usually 1. 
This is similar to the corner constraint used for categorical predictors in log-linear 
regression, except that it takes a multiplicative rather than additive form. In the example 
considered here, the variances of y; and W, are then free parameters related to those of 
certain observed indicators. Suppose \; = \4 = 1, so that 


Y =K] Pi Ui 
Yo; =k Mpi Mi 
Yzi= K3 A3hy; Ui 
Y4i = K4 Wo; Udi 
Ysi= K5 Aspo; Usi 


with the & again being bivariate normal with zero means but all three parameters in the 
dispersion matrix free. 

It may be noted that fixing a loading }, to, say, 1 has utility in preventing ‘relabelling’ 
of the construct scores y; during MCMC sampling. Since Y;— Y; in this example are 
positive measures of spatial ability setting ^; = 1 means the construct , will be a 
positive measure of this ability. If, however, one adopted the predetermined variance 
identifiability constraint such as ys, ~ N(0, 1) where all the à; are free, it may be 
necessary, to prevent label switching, to set a prior on one or possibly more loadings 
constrained to positivity, e.g. 


II 


Mt ~ NC, 1) Z(0, ) 


8.1.1 Extensions to other applications 


This framework may be extended to panel data where multiple items Y;,,, are observed 
for subjects, variables m = 1,.., M and over times ¢ = 1,.., T; here the constructs may 
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also be modelled as time varying (Section 8.4). Repetitions may also occur over groups g 
such that the data have the form Ying, where g is the group, m the indicator, and i the 
subject (Section 8.2). Latent construct models with ; continuous may equally be 
applied to explain multiple discrete indicators, for example collections of binary tests 
or items (Bartholomew and Knott, 1999). One may also consider latent constructs 
which are discrete, and so connections to discrete mixtures (Chapter 2) and discrete 
mixture regressions (Chapter 3) become apparent. Latent class analysis (Section 8.3) is 
often applied in psychology and attitude research, and it may be that the latent 
construct in a particular application can either be modelled as discrete or as a continu- 
ous ‘trait’, and the choice between these options is made on grounds of statistical fit 
(Langeheine and Rost, 1988). 


8.1.2 Benefits of Bayesian approach 


Classical structural equation methods for continuous indicators Y that are based on 
multivariate Normality of the indicators usually involve minimising a discrepancy 
between the observed covariance and the predicted covariance under the model — 
since under multivariate Normality the covariance matrix is sufficient for the data, 
and in particular the associations between indicators. This is the default approach 
in computer packages such as LISREL or EQS, with the input consisting of M x M 
covariance matrices rather than the full nx M array of observations Yim. Browne 
(1984) generalises the multivariate Normal theory to any multivariate distribution 
for continuous variables, so permitting analysis of covariance structures only, provid- 
ing non-Normality (e.g. kurtosis) is not pronounced. It is, however, becoming 
more common to analyse raw data so that appropriate techniques allowing for non- 
Normality (e.g. excessive kurtosis) or potential outliers can be applied. This may also be 
a more flexible option if the indicators contain a mix of discrete and continuous 
measures. 

Considerations of robustness, and other questions such as the ease with which 
parameter restrictions may be imposed and predictions made for new cases, may 
point to a Bayes approach which retains the full observation set as input — this indeed 
is easier to implement in WINBUGS. Comparisons between different models might 
involve variations on the measurement model, the structural model, or both. A Bayesian 
perspective may have advantages here if hypotheses are not be restricted to nested 
alternatives (Fornell and Rust, 1989). 

Scheines et al. (1999) discuss how a Bayes approach may assist in estimating what are 
conventionally described as unidentifiable models, for instance regressions in which all 
predictors are measured with error and the strategy of instrumental variable estimation 
is not feasible. Lee (1992) describes methods to restate the usual classical identifiability 
restrictions as stochastic. For instance, one might assume both a free dispersion matrix 
for factors p; and Yp and free loadings \1,..., Ns, but informative priors on à and \4 
would be taken based on the classical constraint; for instance, taking 


Ay ~ NC, 0.1) 
hg ~ NCI, 0.1) 
allows limited movement around the usual constraints. If all ten loadings were of 


interest and an identity dispersion matrix assumed for ,;, j = 1, 2, then an identifia- 
bility constraint such as 11 = ^42 might be expressed as 
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Ai =QAqQQ ee 


where a is a parameter centred at 1 and e is a low variance error. 

More generally, the Bayes method has potential advantages over maximum likeli- 
hood estimates for describing the densities of the parameters of structural equation 
models. The sampling distribution of parameters in such models is only asymptotically 
Normal (Bollen, 1989), so that maximum likelihood standard errors calculated assum- 
ing the distribution of the estimator is Normal may be distorted for small sample sizes 
(Boomsma, 1983). Bootstrap techniques are one option to overcome this problem 
(Bollen, 1996). However, in the Bayesian approach MCMC samples are taken from 
the true posterior regardless of sample size, and so standard errors calculated 
from MCMC output are more reliable for small samples, or when there are other 
sources of non-normality. 

On the other hand, a maximum likelihood solution generally converges rapidly to a 
solution with a clear labelling of constructs. As mentioned above, precautions (in the 
form of constrained priors on loadings, say) may need to be taken in a Bayesian MCMC 
sampling context to ensure unique labelling of constructs, and prevent label-switching 
during sampling. 


8.2 CONFIRMATORY FACTOR ANALYSIS WITH A SINGLE GROUP 


The advantages of confirmatory models with prior information from a subject matter 
base as compared to exploratory factor analysis (where all loadings are free to be 
estimated) have been argued by Hertzog (1989), in terms of the interpretative ambiguity 
of exploratory analysis. Thus, using an orthogonal rotation or oblique rotation in an 
exploratory factor analysis will give different answers as to factor structures, and 
various often arbitrary methodological choices in oblique rotation will lead to different 
answers about the correlations between the latent constructs. As well as having a clear 
and unambiguous factor structure, which permits evaluation of alternative hypotheses 
(e.g. correlated or orthogonal factors), confirmatory factor analysis permits assessment 
of the construct validity of measurement scales. 

An important question in confirmatory factor analysis is the stability or ‘invariance’ 
of structural relationships and measurement models across groups or over time. As 
Mutran (1989) points out, the availability of panel or group data allows a new set of 
testable hypotheses; for example, should the effect of each indicator on the underly- 
ing construct or constructs be allowed to change and, if so, is the same construct being 
studied? If the measurement model is allowed to vary or change, one may assess whether 
certain indicators of the latent construct lose or gain reliability. Note that for many 
groups, a multi-level factor analysis with appropriate random effects is more relevant 
(Section 8.4.4). 

Among the possible parameterisations for testing multi-group invariance (assuming a 
few groups only) in latent structure, means and covariances is one adopted by Byrne et al. 
(1989). They consider M = 11 indicators of adolescent self-concept in terms of p = 4 
factors: General Self-Concept (GSC), Academic Self-Concept (ASC), Language Skills 
(LSC), and Mathematics (MSC). Further, the relation between indicators and concepts 
may differ by pupil group: there are G=2 groups, namely nı = 582 ‘high track’ 
students and m = 248 ‘low track’ students. Byrne et al. assume that the measurement 
model for the mth indicator in the gth group is parameterised as 
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Ying = Km AmVijg Uimg 


where i = 1,..ng, j = Cl] is that single factor among p factors relevant to predicting the 
mth indicator, and the img are independent distributed for variable m and group g. 
Thus, only one factor is taken to explain any particular item, and the intercepts and 
item-factor loadings in the measurement model do not differ by group. The factor 
construct model for the reference group (e.g. g = 1) is 


Wi z= òy 


for the j = 1,..p factors and the students i = 1, .. ng. The 5 may be taken as independ- 
ent with means zero and variance o?, or to follow a multivariate prior with covariance 
matrix $. For the remaining groups, the factor model is 


Vije = Ag Si 


where Ag; is the intercept on the jth factor for the gth group. 


Example 8.1 Confirmatory factor analysis of adolescent self-concept and extension to a 
group factor model As in the above discussion, we follow Byrne et al. (1989) and 
Byrne and Shavelson (1986) in taking 11 indicators from the Affective Percep- 
tion Inventory (Soares and Soares, 1979) as representing four correlated ability 
concepts. The first, fifth and sixth indicators load on the first general self-concept 
factor, denoted ,. Indicators Y) and Y; load on the ASC factor ,, indicators 
Y3, Yg and Yo on the LSC factor W3, and Y4, Yio and Yı; on the remaining Mathemat- 
ics Self-Concept (MSC) 44. There are thus no multiple loadings, e.g. of Yı on both 


J, and >. 


As one device to gain identifiability, the loading \,; of Y, on wy, is set to 1, the loading 
(Az) of Y> on J, to 1, and those of Y; on W and of Y4 on W, (namely, N; and ^4) to 1 also. 
Further, the observations (for n = 996 subjects) are centred, so eliminating intercepts. 
Hence, the model is 


Yo = Ọpa Ua 
Yo=. un 
Yg = p3 ug 
Ya = Yi Uy 


Yis = Msa uis 
Yis = obi Uie 
Yn = MẸ un 
Yis =Asbhj3 Uig 
Yo = Mop uig 

Y; 10 = Aroq Ui, 10 

Yi i =Anbiq Ui, 11 

The factor scores Y, m = 1,..,4,1= 1,., n, are taken to be multivariate Normal with 


zero means and covariance matrix $, while the errors uj, j= 1,..,M,i=1,..,n, are 
taken to be independent Normal with zero means. The estimated loadings (Table 8.1) for 
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this model, obtained from a two chain 5000 iteration run (and convergence from 250 
iterations), are close to those reported by Byrne and Shavelson. 

The covariances between the factors are all positive, and may be taken to broadly 
support the idea of a unidimensional self-concept factor; the correlation between ASC 
and LSC exceeds 0.7, and that between ASC and MSC exceeds 0.6. One might therefore 
entertain a higher order factor model with a single factor underlying the four specialised 
factors. So a particular form of Equation (8.1a) obtains 


b= BN wy 


where n; denotes the single second order factor scores, assumed to be N(0, 1). One 
rationale for higher order factors is that they explain the associations between lower 
order factors, and so eliminate the need for correlated measurement errors. Here 
identifiability was improved by taking the B, to be confined to positive values, and 
taking a; = 0, since initial analysis suggested none of the intercepts to be non-zero. With 
these constraints, convergence was obtained at around 2000 iterations in a two chain 
run of 5000 iterations. 

The predictive loss criterion of Gelfand and Ghosh (1998) gives virtually identical 
results for the double order model (around 5335) as for the single order model, though 
the former has the advantage of providing extra substantive insight. The posterior 
means of the B, are, respectively, 0.43, 0.58, 0.54 and 0.42, with B4 having the highest 


Table 8.1 Self-concept: covariation between correlated 
factors and loadings relating indicators (M = 11) to factors 
(seven free loadings) 


Variances of Factors (X) and Covariances between Factors 


Èr J Zk) 
Mean 2.5% Median 97.5% 


Ži 0.74 0.66 0.74 0.83 
Zp 0.24 0.19 0.24 0.29 
213 0.21 0.16 0.21 0.25 
Di4 0.21 0.16 0.21 0.27 
Èn 0.67 0.6 0.67 0.77 
223 0.42 0.37 0.41 0.48 
oa 0.5 0.44 0.49 0.57 
33 0.5 0.51 0.58 0.66 
234 0.09 0.04 0.09 0.14 
Daa 0.87 0.78 0.87 0.96 


Indicator-factor loadings 


Ns 0.82 0.76 0.82 0.88 
N6 1.06 0.98 1.05 1.12 
A7 1.03 0.96 1.03 1.1 
Ng 1.11 1.04 1.11 1.18 
No 1 0.92 1 1.1 
N10 0.98 0.94 0.98 1.03 


Au 0.95 0.91 0.95 0.99 
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posterior variance — suggesting that the mathematics self-concept is least aligned with a 
putative single self-concept. The à coefficients in the higher order model are very similar 
to those in Table 8.1. 

Continuing the data and indicator format as above, Byrne et al. (1989) compare two 
groups (denoted high and low track) with a low track group as reference. Then the mth 
indicator for the ith subject in the gth group is modelled as 


Ying = Km AmVijg Uimg 
where the errors u are independently Normal, the reference group scores are 
Wi =ò;  j=1, 4; i= 1,..248 
and the other group’s scores are given by 
Ua =Sy Ap  j=1,4; i= 1,..582 

with the ò; taken to be multivariate Normal. In line with factorial invariance, the 
loadings are not group specific and as above loadings \j,...4 are preset and so the 
covariance matrix È for the ò; may be estimated. Hence, there are 15 intercepts (11 for the 
measurement model, and 4 for the factor model) together with seven loadings to estimate. 
The differential intercepts A; in the high track group are the major focus of interest. 

A two chain run of 10000 iterations (convergent from around 1000) accordingly 
shows higher intercepts on the Academic, Linguistic and Mathematics Self Concepts in 


the ‘High Track’ students (see Table 8.2). This model adequately represents the data 
since a posterior predictive check is 0.54, based on comparing error sum of squares of 


Table 8.2 Self-concept: group comparisons 
under invariance, high track intercepts 


Mean 2.5% 97.5% 


Ai —0.87 —2.83 1.16 


Ad 10.18 8.77 11.58 
A3 4.47 3.33 5.61 
A4 7.48 5.24 9.73 


Correlations between factors 


Corr[l, 2] 0.38 0.31 0.46 
Corr[l, y4] 0.30 0.22 0.37 
Corr[1, 3] 0.24 0.17 0.31 
Corr[p2, 3] 0.61 0.54 0.68 
Corr[p2, y4] 0.62 0.56 0.67 
Corr[p3, y4] —0.02 —0.10 0.05 


Factor loadings 


Ns 0.52 0.48 0.56 
N6 0.36 0.34 0.39 
A7 0.52 0.47 0.56 
Nsg 1.36 1.25 1.48 
No 0.67 0.60 0.74 
N10 0.68 0.65 0.70 


i 0.44 0.42 0.46 
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actual and samples of replicate data. As in the single group analysis, the ASC factor is 
highly correlated over 0.5 with both LSC and MSC, but the latter two are effectively 
independent. 

Another model allows ,;; and p; to be drawn from different multivariate Normal 
densities, and for the indicators to have different loadings ng and intercepts Kg within 
groups (Model F in Program 8.1). There is an improved predictive criterion compared 
to the previous ‘invariance’ model (about 429 000 vs. 438 000). 


Example 8.2 Mental ability indicators: robust CFA Yuan and Bentler (1998) illustrate 
the implications of non-Normality, including possible outliers that distort inferences, 
using mental ability test data analysed by Joreskog (1970) and first presented by 
Holzinger and Swineford (1939). There are in full 26 test items defined for 145 children 
in the seventh and eighth grades of two schools. Yuan and Bentler focus on 9 of the 26 
tests, namely 


. Visual perception 

. Cubes 

. Lozenges 

. Paragraph comprehension 
. Sentence comprehension 
Word meanings 

. Addition 

. Counting dots 

. Straight curved capitals 


They postulate three factors, the first spatial ability taken to explain X;—X3; the second, 
verbal ability, explaining X4—X6; and the third, speed in tasks, designed to explain 
Xı— X9. 

For these data, Yuan and Bentler use a number of robust frequentist techniques 
and densities such as M-estimation and Multivariate t density models, which place 
low weights on exceptional data points. They find an improved fit for these methods 
over Normal theory based maximum likelihood. Here a scale mixture form of 
the Student ¢ with degrees of freedom v, with precisions for case i and variable m 
defined by 


Din = P, Wim 


where the ®,, = 1 IS, is an overall precision for variable m, and the weights wim are 
drawn from Gamma (vm/2, vm/2) densities. The degrees of freedom parameters are 
drawn from an exponential prior with parameters n,,, and are constrained to be above 
1. The n„ parameters are drawn from a uniform (0.01, 1) prior corresponding approxi- 
mately to means | and 100 for vn. 

The means p,,, for indicators X, to Xo are given by 


Bim = Kin Mn, cm", C[m] 
where {C = 1, 1, 1, 2, 2, 2, 3, 3, 3} is the factor index for variable m. The scale of the 
constructs is defined by fixing \y; = Aq. = 73 = 1, so that the full covariance matrix 
between factors is estimated. The loadings under the alternative method when their 
variance is set at 1 (and the covariance matrix becomes a correlation matrix) are derived 
in parallel, namely 
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Table 8.3 Mental ability: loadings and degrees of freedom estimates 


Loadings Degrees of Freedom 

Mean 2.5% 97.5% Mean 2.5% 97.5% 
an 4.6 2.8 5.9 vi 20.4 2.4 87.9 
Oo) 2.5 1.1 4.3 v 30.6 3.0 225.4 
C3) 6.3 2.1 11.1 V3 65.4 4.1 307.7 
Cap 1.6 1:2 1.9 v4 33.4 4.4 167.5 
59 2.1 1.4 2.8 Vs 27.9 3.7 127.6 
Co 3.5 2.4 4.8 V6 18.7 2.4 111.2 
073 17.1 14.0 20.1 v7 53.7 5.3 2144 
C33 19.0 13.4 27.0 vg 2:5 1.0 6.9 
C03 24.4 16.3 36.8 vo 26.9 4.3 120.2 


Correlations between factors 


T12 0.56 0.38 0.72 
T13 0.39 0.20 0.56 
T23 0.23 0.05 0.39 
Enj = NmjSm 


A simple measure of fit is provided by the square root of the Mean Error sum of 
squares (MSE). A posterior predictive check, based on comparing observed mean 
square error and that obtained with replicated item scores (sampling from the posterior 
means under the model) shows acceptable fit. The check criterion comparing the 
new MSE with the observed one averages about 0.54. The predictions may also 
be used in criteria such as those of Gelfand and Ghosh (1998) and Ibrahim et al. 
(2001). 

The loadings and the relativities between them (in terms of ratios \mj/Nmk, Where 
j, k = 1, 3) are similar to those of Yuan and Bentler. However, the credible intervals are 
wider than those implied by the standard errors of the robust procedures in Yuan and 
Bentler (1998, Table 1). For example, the standard errors of Cg; over the five robust 
procedures vary from 1.67-1.82, as against a posterior standard deviation of around 
3.2. Asymmetry in the densities of these loadings is also apparent, and would not be 
allowed for in many frequentist procedures. The degrees of freedom specific to each 
indicator show the ‘Counting Dots’ item as the most dubious in terms of Normality. 
Correlations between the factors suggest verbal and spatial ability to be positively 
associated. 

The weights w;m show clear outliers for some variables. On ‘Counting Dots’, the 24th 
and 106th subjects seem to be extreme outliers, with weights wig below 0.005; these 
are also identified by methods applied by Yuan and Bentler. The profile of remaining 
cases on this variable is much less subject to skew; the next weight (for subject 35) is 
nearly 0.8. 


Example 8.3 Nonlinear and interactive latent variable effects The question of non- 
linear effects of latent variables or of interactions between them has been raised in 
applications of structural equation models (e.g. Bollen and Paxton, 1998). Arminger 
and Muthen (1998) and Zhu and Lee (1999) consider Bayesian approaches for possible 
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schemes involving one or more latent variables €;1,. . €, for which indicators Xj1, . . X; 
are available. If, also, Wii- -Wip have indicators Yj,.. Yiy, then one may have a 
structural model whereby the endogenous constructs are predicted by powers of, or 
interactions between, the €,. For instance, if p = 1 and q = 2, one might have 


2 2 
Wi=a yia ya 362 Yaso  YsEi E22 


Arminger and Muthen (1998, p. 285) consider simulated data for n = 100 subjects based 
on a measurement model relating five indicators to two constructs, &; and &, and a 
structural model for a single observed and centred response Y; modelled as a function of 


(a) main terms in ë; and &é, 
(b) an interaction between them, and 
(c) a random error, w;, independent of the constructs. So 


Yi = iE Yo Y3EiEs Wi 


This type of model has applications in performance testing (see Early et al., 1990), where 
€, might be task complexity and &, goal specificity, with y, and y, expected to be 
positive, but y, negative on theoretical grounds. 

The measurement model for the simulation involves loadings Amx from items 
Xm(m = 1,..5) to factors k as follows: Mı; = 1 (preset), M21 = 0.7, M31 = —0.5, Ag = 1 
(preset), Ns2 = 1.6. The coefficients in the structural model are 


{Yb Y2> Y3) = (0.8, 1.7, 0.5) 


The remaining parameters used to simulate the data are contained in Program 8.3, with 
the dispersion matrix È of the constructs ¢; including a positive correlation between the 
two factors. 

Satisfactory convergence in this and similar small sample examples may be assisted by 
good starting values (e.g. from maximum likelihood analysis) and by further prior 
information on the structural or measurement model parameters (e.g. that the inter- 
action parameter is expected to be positive or negative). 

Here N(0, 1) priors on the free loadings à and coefficients y are adopted without 
constraints to positive or negative values (with the loadings M11 = ^42 = 1 being prede- 
termined). Priors on precisions are as in Arminger and Muthen. Some guidance as to the 
labelling of the constructs (in terms of ensuring identifiability without label switching in 
MCMC sampling) is provided by the preset loadings \1; and Naz. Clear convergence, 
and incidentally convergence towards the parameters which reasonably approximate 
the generating parameters (Table 8.4), is apparent by around 5000 iterations in a two 
chain run of 25000 iterations. 

An inconsistency with the mode of generating the data is apparent if the assumed 
model is taken as 


Y= ya neo kno ëa sD 


This model (Model B in Program 8.3), estimated with the same priors as above (and 
with a 25000 iteration run), shows ‘significant’ y4 and y5, with 95% credible intervals 
(0, 0.2) and (—0.56, —0.07). Estimates of the other coefficients are similar to those of 
Model A. The predictive criterion of Gelfand and Ghosh, here relevant for predicting 
both indicator sets X and Y (via the variables Z.X and Z. Y in Program 8.3), shows an 
improved prediction for the extended model. 
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Table 8.4 Nonlinear factor effects, parameter summary 


Mean St. devn. 2.5% Median 97.5% 


xu 1.66 0.25 1.24 1.64 2.22 
12 0.27 0.13 0.03 0.26 0.54 
X29 0.27 0.13 0.03 0.26 0.54 
yı 0.66 0.09 0.49 0.66 0.84 
y2 1.73 0.21 1.37 1.72 2.18 
3 0.51 0.11 0.31 0.50 0.74 
N21 0.63 0.05 0.53 0.63 0.74 
N31 —0.42 0.06 0.54 0.42 0.32 


N52 1.36 0.16 1.08 1.36 1.68 


8.3 LATENT TRAIT AND LATENT CLASS 
ANALYSIS FOR DISCRETE OUTCOMES 


As discussed above, the Normal linear model relating M continuous outcomes Y to p 
underlying continuous factors | has the form 


Yim = Km Ami Yi Amr oa AmpWip oa Uim m = l, ..’ M 


for i= 1,..n subjects. For identifiability, one option is to assume the factors 
Ẹ; = (as -- Yp) have zero mean and variance unity. The residual error terms u are 
usually taken to be conditionally independent, since the ,; are intended to explain the 
correlations among the observed items. 

A similar framework may be postulated for observations on M discrete items (e.g. 
binary or ordinal data), which are to be explained by p metric factors. For example, 
again assuming conditional independence, suppose we have M binary items and that 
there is p = 1 latent trait. Then a typical model analogous to that above has the form 


Yim ~ BErn(Tim) i=l,..nņ, m= 1,..M 
logit(Tim) = Km Amn}; 
where the W, are again standard Normal factor scores (sometimes called ‘latent traits’ in 
psychological or educational applications), where the km represent the success rate on 
item m, and where the Nm are loadings relating the mth item to scores on the latent trait. 


A special variant of this is the Rasch model in educational testing, where Xm = 1 for all 
items, 

logit(tm) = Km Y; 
where Km represents the ‘difficulty’ or discriminatory power of the mth item, and the y; 
are interpreted as abilities or frailties of the subjects. 

A number of modelling schemes have been proposed for other types of multivariate 
discrete outcomes, including ordered multinomial outcomes and models for correlated 
count outcomes (Chib and Winkelmann, 2000). Options include non-conjugate models 
with multivariate Normal or Student ¢ error densities within a log link (Poisson) or logit 
link (binomial) framework, or models involving latent continuous observations Y* 
which parallel each observed discrete variable. However, alternatively, a multivariate 
error structure of dimension M might be replaced by a factor structure with p < M 
constructs. 
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For example, suppose the observations consist of a mixture of M-H continuous 
variables and H ordinal outcomes containing r1, r2,..rg categories, respectively. For 
simplicity, assume r =r, = r2 =..=ry, and that the observed ranks are denoted 
Yin, i= 1,..n, m= 1, H. To model correlation among these variables or introduce 
regression effects, one may define (Chapter 3) latent variables Y;*,and cut points 8,,, 
on their range, such that 


Yim =k if Sm, k-1 Š Yin < Oink 
with r — 1 cut-points, defined as 
=o < Sml <.. om, r-1 SOO 


As noted above, the YÄ, might be taken to be multivariate Normal or Student ¢ of 
dimension H. For binary outcomes, Y;* is constrained to be positive when Y; = 1, and 
sampling Y* from a Normal with variance 1 is equivalent to assuming a probit link for the 
probability that Y = 1. With a large number of ordinal or binary outcomes, observed 
together with metric outcomes, one may consider modelling the data in terms of a smaller 
number of constructs: then the combined set of variables (Yj*,.. YŠ, Yu 1,... Ym) is 
expressed in terms of the usual LISREL model in Equation (8.1), but with variance 
structures for the Y* variables defined by identifiability. 


8.3.1 Latent class models 


In some circumstances, it may be more plausible to treat the p latent variables as 
categoric rather than metric. Thus, subjects are classified into one of K classes if 
p = 1, or cross-classified into one of Kı x K classes if p = 2, rather than being located 
on a continuous scale or scales. For example, Langeheine (1994) considers a longitu- 
dinal setting where observed items relate to children at different ages, with the changing 
observations on the items taken to represent stage theories of developmental psychology 
which postulate distinct stages (i.e. discrete categories) of intellectual development. In 
other circumstances, there may be no substantive rationale for preferring a latent trait 
or latent class model, but both provide adequate fit to the observed data — so leading to 
model indeterminacy (Bartholomew and Knott, 1999, Chapter 6). 

Let the prior probabilities on the K classes of a single latent category be denoted ng, 
with Èn, = 1. The ‘independent variable’ y is now comparable to a categorical factor in 
log-linear models, with the first category providing a reference category under a “corner 
constraint’. The above example for M binary items may now be expressed 


Yim B Bern(Tim) 
Logit(ttin) = Km Amd; =2] =. mx BLb; = K] 


where d[; = k] is 1 if the ith subject is allocated to the kth category of W and 
of; = k] = 0 if |; is not allocated to the Ath category. Depending on whether K = 2 
or K > 2, the subjects are allocated to classes according to 


Y; ~ Bern(n) 


with n assigned a beta prior (say), or according to 


(8.2a) 


wb; ~ Categorical(n) 


where n is of dimension K and may be assigned a Dirichlet prior. 
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An alternative, but ultimately equivalent, modelling framework takes 
Yim ~ Bern(ty,, m) (8.2b) 


where W; has K categories of ‘caseness’ and mgm are the probabilities of items m for 
subject i according to that subject’s caseness. With priors on y, as above, beta priors 
may then be assigned to the Tm or a link function used such as 


logit(Tkm) = 98m k=1,..,K (8.2c) 


where Normal priors, perhaps constrained to produce consistent labels during MCMC 
sampling, are adopted for the km. 

Latent categories may be useful in medical diagnosis where the observed indicators Y 
are various tests or criteria of illness, with none being certain or ‘gold standard’ 
indicators of the presence of a disease. Thus, following Rindskopf and Rindskopf 
(1986), one may assess whether a single latent categorisation (e.g. if K = 2, the latent 
categories might be ill vs. not ill) underlies several observed binary diagnostic items, 
none of which provide a ‘gold standard’ test. This type of model is illustrated with 
psychiatric caseness data from Dunn (1999). 


Example 8.4 Psychiatric caseness Dunn (1999) presents data on three binary diag- 
nostic items applied to n = 103 patients. The binary responses are taken from dichot- 
omising more extensive scales, and are denoted CIS (Clinical Interview Schedule), GHQ 
(General Health Questionnaire) and HADS (Hospital Anxiety and Depression Scale) 
(see Dunn, 1999, p. 7). Let Gaj denote the totals of patients in category h of CIS, i of 
GHQ, and j of HADS, where h, i and j = 1 for less ill, and A, i and j = 2 for more ill. 


Then as in Dunn (1999), one may apply a log-linear model incorporating the classifi- 
cation LC representing the patients latent diagnostic status (assumed to have K = 2 
states, with index k = 1, 2). In terms of the notation of Section 8.3.1, LC corresponds to 
the latent caseness category i,,;. As in a more general latent mixture regression (Chapter 3), 
there are questions around label switching in MCMC sampling, and so of adopting 
priors which ensure consistent labelling. The likelihood is multinomial 


Gry ~ Mult(a;, n) 
with 
Thij = Xk Whijk / Žhijk Whijk 


and log (mp;jx) modelled in terms of (a) the main effects in the items and the latent 
variable, and (b) the interaction effects between the items and the latent variable. 
Specifically, 


log Prine = C Bia Boi Bs; Bak Oak @2,ik 3, j,k 


As usual in a log-linear model, there are corner constraints on the parameters (e.g. 
Bu = Bor = B31 = Ba = 9). 

An alternative analysis involves disaggregating the observed data to individual level, 
and adopting Bernoulli sampling, especially if there are continuous covariates which 
assist in predicting the latent diagnosis. 

In adopting the aggregate approach, a prior N(0, 10) confined to positive values is 
adopted for the free parameter B4, 2 so that category two of the latent class variable 
identifies more ill patients (the ‘cases’). A two chain run of 10000 iterations (with 
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convergence from around 4000 iterations) then produce similar results to those cited by 
Dunn (1999, p. 42) for the main effects of CIS, GHQ and HADS items, and of the latent 
diagnosis LC, and for the interactions CIS x LC, GHQ x LC and HADS x LC. The 
posterior medians of logged odds ratios describing the interaction between LC and the 
three items actually used are given in Table 8.5 (i.e. the interaction parameters CIS.LC, 
GHQ.LC and HAD.LC). These are 3.67, 4.08 and 3.86, respectively, and indicate a 
consistency between the three items. 


Table 8.5 Psychiatric caseness: items (CIS, HAD, GHQ) and Latent Class 
(LC) 


Mean St. devn. 2.5% Median 97.5% 


Aggregated (log-linear) model 


CIS.LC 3:73 0.84 2.32 3.67 5.70 
GHQ.LC 4.23 1.05 2.64 4.08 6.61 
HAD.LC 3.99 0.97 2.45 3.86 6.34 
CIS —2.08 0.73 —3.88 —1.98  —0.97 
HAD —2.32 0.83 —4.31 —2.19 —1.10 
GHQ —2.86 0.98 —5.04 —2.69 —1.45 
LC —4.71 0.85 —6.62 —4.63 —3.26 


Disaggregated model Class Probabilities (Prob LC = k) 


nı 0.42 0.05 0.31 0.42 0.49 
n2 0.58 0.05 0.51 0.58 0.69 


Marginal Tables 


(1) CIS by LC 


Both No 39 3 34 39 44 
CIS Yes, LC No 5 2 0 5 10 
LC Yes, CIS No 10 3 5 10 15 
LC Yes, CIS Yes 49 2 44 49 54 
(2) GHQ by LC 

Both No 42 3 35 42 48 
GHQ Yes, LC No 2 2 0 2 6 
LC Yes, GHQ No 12 3 6 12 19 
LC Yes, GHQ Yes 47 2 43 47 49 
(3) HAD by LC 

Both No 40 3 34 40 45 
HAD Yes, LC No 4 2 0 4 8 
LC Yes, HAD No 10 3 5 10 16 
LC Yes, HAD Yes 49 2 45 49 53 


Probabilities m&m by LC (k = 1, 2) and Item (m = 1, 3) 


Ti 0.11 0.06 0.00 0.10 0.25 
712 0.05 0.05 0.00 0.04 0.17 
713 0.08 0.06 0.00 0.08 0.22 
T21 0.83 0.06 0.71 0.84 0.94 
T2 0.79 0.07 0.65 0.79 0.92 


T23 0.83 0.06 0.71 0.84 0.94 
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Marginal tables cross-tabulating patients by their caseness on each item and caseness 
on LC can also be obtained. If the categorisation by LC is regarded as the ‘true’ 
diagnosis, then the sensitivity of each item can be obtained with regard to detecting 
the true diagnosis. The overall probability of being a case (n2) is estimated as 0.56. 

A suggested exercise is to include second order interactions with LC (via parameters 
Vinik> Yonjk> Y3ijk) and assess fit via a predictive criterion — for instance, using samples of 
new cross-classified data Grey, nj. Note that this model yields a higher probability of 
being a case. 

The second analysis uses disaggregated item data with Bernoulli sampling, and the 
model in Equations (8.2b)-(8.2c). The prior constraint n, > n; is set for identifiability. 
One possible set of outputs from this analysis is the marginal cross-tabulations of item 
by LC classification. For instance, in the LC by GHQ table, the LC variable classifies 59 
patients as cases, whereas there are 12 patients classified as well by the GHQ but who 
are cases according to LC. On this basis, the sensitivity of the GHQ (with LC regarded 
as the ‘true’ diagnosis) is obtained as 47/59 = 80%. 

The sensitivities are equivalently the probabilities 72,, of being classed as more ill on 
the three indicators given |, = 2. They show that the CIS scale has a high sensitivity 
(83%), but also a relatively high chance of classifying someone as ill when they are well 
according to LC (i.e. a false positive). 

Finally, a latent trait analysis with a single continuous factor is carried out, namely 


Yim ~ Bern(tim) i=1,..mm=1,..M 
logit(tim) =Km Amb; 


with all Nm as free parameters and y, ~ N(0, 1). N(0, 10) and N(1, 10) priors are 
assigned to Km and Nm, respectively. Sampling new data Zim ~ Bern(aj,,) and accumu- 
lating the subjects according to whether Zim = 0 or 1 (m = 1, 2, 3) forms a predicted 
cross-classification Gi, new. These can be compared with the actual counts by a predict- 
ive fit criterion (Gelfand and Ghosh, 1998). Another possibility with the latent trait 
model is the opportunity to choose K > 2 cut points on the latent trait W,; for instance, 
taking K = 3 might correspond to the divisions: well, some symptoms, and definitely ill. 
Here K = 2 is chosen with the cut point at zero, and H[1:103] is the resulting classifier in 
Program 8.4. 

This model appears to have an edge on the latent class model, with the predictive 
criterion (when w = 1) being around 91 compared to 111 under the LCA model; its 
advantage is in providing more precise predictions. From a 10000 iteration run in two 
chains, the posterior means and 95% credible intervals of A = (Ay, M2, M3) are 3.3 
(1.7, 6.5), 3.9 (1.9, 7.5), and 3.7 (1.8, 7.2). The selected cut point on the latent trait 
leads to similar marginal tables as those in Table 8.5, though slightly fewer patients are 
classed as cases (namely those with positive ,). 


Example 8.5 AIDS risks To illustrate the application of a latent trait structure for 
modelling associations between a mix of continuous and discrete outcomes, data from 
Shi and Lee (2000) on sexual practices and attitudes of female sex workers in the 
Phillipines is used (see also Song and Lee, 2001). Three indicators (Yj, Y2, Y3) are 
continuous and three (Y4, Ys, Ys) ordinal. The last three relate to AIDS risk from 
intercourse with an AIDS infectee, a stranger or a drug user, and are positive measures 
of ‘caution’ or ‘worry’ about contracting AIDS. They all have r = 5 ranked categories, 
from 1 = no risk to 5 = high risk. By contrast, the first three measures are positive 
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Table 8.6 AIDS risk: bivariate factor model 


Mean St. devn. 2.5% 97.5% 


Cut points 

811 —1.57 0.09 —1.76 —1.40 
8p —0.71 0.08 —0.86 —0.56 
813 0.42 0.07 0.27 0.57 
814 0.95 0.08 0.79 1.11 
851 —4,33 0.32 —5.02 —3.77 
809 —3.24 0.24 —3.78 —2.84 
$53 —1.39 0.13 —1.68 —1.17 
804 —1.04 0.11 —1.28 —0.85 
83] —3.30 0.29 —3.99 —2.85 
832 —2.22 0.21 —2.73 —1.89 
$33 —0.20 0.09 —0.37 —0.02 
834 0.25 0.09 0.08 0.44 
Loadings 

Nu 0.31 0.06 0.19 0.43 
A21 0.66 0.13 0.46 0.96 
N31 0.39 0.08 0.25 0.55 
Nar 0.56 0.25 0.22 1.12 
N52 1.16 0.43 0.49 2.17 
N62 1.36 0.49 0.62 2.42 


Factor correlation 


w —0.15 0.09 —0.38 —0.01 


measures of ‘recklessness’: relating to frequencies of actual intercourse, of ‘hand jobs’ 
and of ‘blow jobs’. 


Following Shi and Lee, two latent variables ,, i = 1,..n, j = 1, 2 may be proposed 
with the 6x2 matrix A of loadings of (Y,,...., Y6) on the factors taking the following 
structure: 


N= Ait A21 AB] 0 0 0 
0 0 O ày s g2 
where Aji, m = 1,..3 are free loadings relating Y;— Y3 to p; and Xm, m = 4, . . 6 relate 
Y4— Y6 to Wp. Thus, if the loadings are consistently positive, the factors are interpretable 
as overall levels of worry about AIDS and recklessness, respectively. The are 


bivariate Normal with zero means, and their covariation described by (and only 
identifiable as far as) a correlation matrix, 


ay 


Note that one might consider a single factor model as an alternative, since conceptually 
the factors seem to overlap. 
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The three continuous indicators Y;m are taken to be Normal and, despite being 
standardised (as they are by Shi and Lee), taken to have means 


Km Ami 


with variances ¢,,. Observed values on the three ordinal indicators Y;,, are determined 
by underlying latent variables Yj,, with r—1 cut points 8,.x.(m=4, 5, 6; 
k= 1,..,r— 1). Thus, 


Yim ~ Categorical (Pi, m, 1v) m= 4, 5, 6 


where 
Pi, m,1 = Qi, m,1 
Pi, m, k = Qi, m, k — Qi, m, k-1 k=2,..,r—1 
Pi, m,r = l — Qi, m, r-1 

and 


logit [Q;, m, k] = 8ink — Bim 


The subject means p;„ for the ordinal variables are given by 


Hig = Again 
bis = Asoo 
Lis = Aerio 


Shi and Lee obtain a negative correlation w of —0.17 between ,; and p, and positive 
loadings on all of \,,; and Am2. Here a constraint of positivity on the loadings is applied 
for identifiability. Another option might be an exchangeable prior Nme ~ N(,, 0%), 
where w, is constrained to be positive. 

A two-chain run of 2500 iterations (with convergence after 1250) reproduces the 
structure obtained by Shi and Lee (see Table 8.6), namely positive loadings on the 
first factor (worry about aids) and on the second indicative of recklessness, and accord- 
ingly a negative correlation of around —0.15 between the factors. The density of œ 
shows some negative skewness, so the ratio of the posterior mean to posterior standard 
deviation may mislead in terms of ‘significance’. 


8.4 LATENT VARIABLES IN PANEL AND CLUSTERED DATA ANALYSIS 


The structural equations approach, whether with latent traits or classes, may be applied 
to repeated data over time. For instance, Muthen (1997) stresses the role of latent 
variable methods in panel and multi-level models with random coefficients and variance 
components, as against the conventional psychometric application in terms of con- 
structs measured with error. Molenaar (1999) cites some of the benefits of longitudinal 
data as a way to understanding causal mechanisms, and also offering ways of control- 
ling the processes under scrutiny, depending on interventions made or not made. The 
elements of procedures to establish causality are illustrated by latent or observed 
variables ,, and p at times ¢ with coefficients B; relating Y; to and , ,_,. Then, if 
w, is a cause of |, but not vice versa, one expects B,;, = 0 and B»; 4 0; such procedures 
are formalised as Granger causality (Granger, 1969). A variety of this type of inference, 
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involving causal inter-relations of latent constructs over time, is illustrated by the well 
known alienation study of Wheaton et al. (1977), considered in Example 8.11. 


8.4.1 Latent trait models for continuous data 


Corresponding to the canonical cross-sectional LISREL type specification in Equation 
(8.1) suppose the data consist of time varying outputs possibly multivariate { Yin;}, and 
time varying inputs {Xi}, which are respectively indicators for time varying endogen- 
ous construct W,, and a time varying exogenous construct &,. Then the measurement 
model (for t = 1,.., T) has the form 


Yim =, Mabe Uy m=1,..M (8.3a) 


Xit = K} hay Sir Uji j=1,..L (8.3b) 


jt 


and the associated structural model could be simply 


We =O YEr Wit (8.30) 


As in Chapter 6, one may also propose permanent subject effects. Accordingly, Long- 
ford and Muthen (1992) suggest a measurement model with both constant latent effects 
Y; and time varying latent effects y>; for multiple observed items Ymr: 


Yim =Km Àm Wi; Am; Uimt (8.3d) 


though this model raises identifiability issues. For example, if no constraints are placed 
on the M pairs of loadings {jn1, Nm2}, then both construct variances must be fixed. If 
the measurement model included only stable traits, then one might consider changing 
loadings Att. 

As a generalisation of Equation (8.3c), various autoregressive structural models 
are possible (Molenaar, 1999; Hershberger et al., 1996), for example, an AR(1) lag in 
both y and £ 


Wir =O Bpr- YEr Wi 


Molenaar (1999) outlines state-space models for a multivariate latent vector 
Wir = Wis Van- --Vip) Of dimension p<M, and the joint evolution of 
{ Yim, m = 1, M} and (hy, j= 1, .. p} is described by the matrix model 


Pr = Bpa- Wit (8.4a) 


Vit = Arh, Uit (8.4b) 
where B, is p x p and A; is M x p. 


8.4.2 Latent class models through time 


The latent class models of Section 8.3.1 may similarly be extended to models where 
repetition is over times instead of items, or possibly over both times and items 
(see Hagenaars (1994) and Langeheine (1994)). For the latent class model, the addition 
of a time dimension means changes in underlying state may occur, and in this context a 
latent class model may have the edge over a latent trait model in terms of substantive 
interpretability. Consider repeated observations on a binary item for subjects i: 
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Yu ~ Bern(77;;) 
or equivalently, defining the observations as (1, 2) instead of (0, 1): 
Yu ~ Categorical(p; +, 1:2) 


where pı and p;ņn are the probabilities of Yy = 1, and Yy = 2, respectively, with 
Pii = Ti. More generally, an observed variable with R categories will have likelihood 


Yu ~ Categorical(p;, ;, 1r) 


A latent class model for this type of observational series is often specified by assuming 
Markov dependence between the observed and/or latent states at times ¢ and t— 1. 
Under a first order Markov model, the state occupied at time ¢ depends only upon 
the previous state, and not on any earlier ones. Then one possible model involves a 
Markov chain defined by the observed categories, but mixes transition behaviour over 
a latent state y; with K categories (van de Pol and Langeheine, 1990). It is necessary to 
consider the first period observation Y; separately from others, as it has no (observed) 
antecedent, and so initial allocation probabilities 6,(k = 1,..K) may be assumed, 
which are also differentiated by the latent state s. So for subjects i = 1,..n, one might 
specify 


wb; ~ Categorical(n,.x) (8.5a) 
Yi ~ Categorical(8y,, 1:r) (8.5b) 
Yu ~ Categorical(p,,, Yo ER) fd (8.5c) 


Another option models the Y; as independent of previous observed category Yj, ;-1, but 
involves transitions on a latent Markov chain, defined by a variable y, with K states. So 
the multinomial probabilities n are now specific for ;, ,_,, with 


pa~ Categorical(ny,, Sete) >l (8.6a) 
and the observations have multinomial probabilities defined by the selected category 
of ;, 
Yi ~ Categorical(py,, o ER) (8.6b) 
The first period latent state is modelled as 
Y ~ Categorical(è:x) (8.6c) 


A higher level of generality, analogous to Equation (8.3d) for metric data, would 
be provided by a model with mixing over a constant latent variable with K categories, 
and a latent transition variable ¢;, with L categories defined by the mixing variable. 
So 

wb; ~ Categorical(ny.x) 

Ci ~ categorical(ky,, 1.1) 

Ci, ~ categorical(ly,, i-i, 1) > 1 

Yi, ~ categorical(p,,, Cir, 1R) 
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8.4.3 Latent trait models for time varying discrete outcomes 


The time varying factor generalisation to multiple discrete responses Yj; and predictors 
Xj may, especially for binomial or ordinal data, involve latent continuous underlying 
variables Yi and Xii The model in these latent variables then resembles measurement 
and structural models adopted for continuous data, as in Equations (8.3a)-(8.3c). 

Thus, Palta and Lin (1999) propose a model for observations on M longitudinal 
binary and ordinal items which allows for measurement error by introducing a single 
latent construct (or possibly p latent constructs, where p is less than M). Their empirical 
example consider the case where all outcomes y are binary, specifically M = 2 binary 
items relating to tiredness. Then, as usual for binary outcomes, suppose there is an 
underlying latent scale Y* for each item: 


Yne=1lif Ya, > 


8.7 
=0if YA < a 


Yo=1lif YŠ >t 


8.7b 
=0if YA <m oe 


The latent scale is in turn related to an underlying time-varying continuous construct &; 
in a structural model: 

Yi: = Bigs Wi (8.70) 

Yi; = Bači Wiz (8.7d) 
where B; = 1 for identifiability. The structural model then relates the ë; to observed 
covariates Xj; (taken as free of measurement error): 

Gi =K MXi NX... Cit (8.7e) 

In the Palta—Lin model, the ex are taken to be autocorrelated, with 


are (1) 
eit = Piei, t-1 Uit 


(8.7f) 
ul) ~ NOO, of) 


while the errors wy;(k = 1,..K) are also autocorrelated, but with the same variance for 
all k: 


hte oe : (2) 
Wikt = P2Wik, t-1  Uikt 


u? ~ N(0, 63) 


ikt 
Because, for binary data, the scale of the latent Yj, is arbitrary, a fixed scale assumption 


such as var(w)=1 is typically assumed. Here the constraint oj 035 = 1 is adopted to fix 
the scale. So that k can be identified, the thresholds t; and t2 must also be set to zero. 


8.4.4 Latent trait models for clustered metric data 


Panel data is a particular type of clustered design, and the principle of multivariate data 
reduction applies to other types of data which are hierarchically structured. Thus, 
consider cross-sectional data with level 1 units (e.g. pupils) clustered by higher level 
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units (e.g. schools at level 2), with latent variables operating at each level. Assume a two- 
level model with pı factors at level 1 and p factors » at level 2 (clusters), and a 
continuous outcomes Yj, for clusters j = 1,..J, individuals i= 1,...n,; within clusters, 
and variables m = 1,.. M. 

Assume further for illustration that p = pı = p2 = 2. Then one might take 


Yim = Km Amy Am2, Ym Vi Yn22ii U2jm — Ulijm (8.8) 


where the level 1 errors tiim are Normal with variances o and the level 2 error is MVN 
of order 2 with dispersion matrix $. The priors adopted for the variances/dispersions of 
the factor scores {1;, 2), Vij, Yo} depend in part upon the assumptions made on 
relationships between the loadings at different levels. 

Thus, the dispersion matrices ©; and ®, of the constructs = (Yi; Wz) and 
P = (Pı; Pz), respectively, are assumed to be identity matrices if the M x 2 loadings 
A= {\m, Nm} at level 2 are estimated independently of the M x2 load- 
ings T = {p15 Ym2} at level 1. For pı = pz = 2, factors’ at each level, there also needs 
to be one constraint on the level 2 loadings (e.g. setting 41; = 1) and one on the level 
1 loadings (e.g. setting y,,; = 1) for identifiability. Setting structural relationships be- 
tween the loadings at different levels (or setting extra loadings to fixed values) makes 
certain dispersion parameters estimable. For example, one might take A =T. 
Depending on the problem, further constraints may be needed to ensure identification 
under repeated sampling in a fully Bayesian model. 


8.4.5 Latent trait models for mixed outcomes 


Analogous models including mixtures of discrete and continuous outcome variables 
have been proposed (e.g. Dunson, 2000). Thus, consider a set of observations Yj, on 
variables m = 1,.. M, for clusters j = 1,..J and sub-units i= 1,..n; within clusters. 
Linked to the observations are latent variables Yin drawn from densities in the expo- 
nential family (e.g. normal, Poisson, gamma), with means 0jj, = E( Yim) predicted by 


A(®im) = BXim P; Vim Yy Wi 


where / is a link function. Xj, is an M x 1 covariate vector with impact summarised by 
a population level regression parameter B, and the 9; and y; are, respectively, vectors of 
cluster latent variables and latent effects specific to cluster and sub-unit. Vim and Wim 
are vectors of covariates, and may be subsets of the Xj), but often are just constants, 
with Vim = Wim = 1. The latent variables Yj, and Y} (m Æ I) are usually assumed 
independent conditionally on ọ; and ,. Frequently, the Yin are Normal with identity 
link and hence expectation 


Bim =BXiim Vim Uy Wij 


and diagonal covariance matrix of dimension M x M. 

In the case Vim = Wijm = 1, an alternative formulation for 0; takes the g; and pj as 
having known variances (e.g. unity), and introduces factor loadings Nm and y, specific 
to variable m. For example, with a single factor at cluster and cluster-subject level 


' For pı factors at level 1 and p at level 2 and with A estimated independently of I’, there are pı(pı — 1)/2 
constraints needed on A and p2(p2 — 1)/2 on I. Informative priors may be an alternative to deterministic 
constraints. 
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Bim = BXim AP; Yi 


As an example where all the observations Yj, are all binary, the Yin could be taken as 
latent Normal variables, such that 


¥e,= 1) af Yin > 0 


For identifiability the variances of Yj, are taken as unity, so that the probability of an 
event, i.€. Tijm = Pr( Yim = 1), is 


D(BX im P; Vij Yy Wiim) 


where ® is the distribution function of a standard Normal variable. The Poisson is an 
alternative latent density in this example, with 


Yim =1 if Yin > hm 


Yim =0 YE, < hm (8.22) 
where An is a threshold count (e.g. unity), and where 
Yin ~ Poi(8jm) (8.9b) 
and 
log (jim) =BXim Vim by Wii (8.9c) 


The ‘hits’ variable A} may be a free parameter, and may differ between variables m. 
If hy =1 then Model 9 is equivalent to complementary log-log link for 
Pr( Yim = 1) = Tijm, namely 


log ( Ei log d- Tijm)) = BXim P; Vi Wi Wi 
Another possibility is a ‘no hits’ mechanism in (8.9a) defined by 


If the Yim consisted of Mı binary variables and M-M, continuous variables, then 
(cf. Muthen, 1984) one sets observed and latent variables identically equal 


Le m= Mı 1,..M 


while one of the latent variable options above for the binary outcomes m = 1,.. Mj, is 
used, such as 


Yim =1 if Yin >0 


A diagonal dispersion matrix V = cov(Y*) will then have M-M, free variance param- 
eters. Extensions to the case where the set of the Y; includes polytomous outcomes, 
with categories ordered or otherwise, can be made. 


Example 8.6 Changes in depression state The first two of the models in Section 8.4.2 
for discrete longitudinal series, as in Equations (8.5a)-(8.5c) and (8.6a)-(8.6c), were 
applied to data on 752 subjects for T = 4 periods (Morgan et al., 1983). The data were 
binary (R = 2), with 0 = ‘not depressed’ and 1 = ‘depressed’, coded in the categorical 
form (1, 2) in Program 8.6. Following Langeheine and van de Pol (1990), two latent 
states are assumed on the underlying mixture or latent transition variables. 
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The mixed Markov model with mixing only over a discrete latent variable ,, but no 
latent transitions is applied first, as in Equation (8.5) (Model A). To gain identifiability, 
the following constraints are made: 


N2 > 11 
dy > 811 
821 > 822 


where 8y,, ;, j = 1, R defines the multinomial likelihood for the initial observations Y; . 
To set these constraints gamma priors are used, and then the property that Dirichlet 
variables can be obtained? as ratios to the sum of the gamma variables. These 
constraints are based on the maximum likelihood solution reported by Langeheine 
and van de Pol (1990, Table 4, Model D). 

A two chain? run of 5000 iterations shows early convergence under the above 
constraints. The posterior parameter estimates suggest a small group (n; = 0.15) with 
an initially high chance of being depressed (812 = 0.65), but around 40% chances of 
becoming non-depressed (p,, = 0.40), as in Table 8.7. The larger latent group 
(n = 0.85) has a high initial probability of being non-depressed and high rate of staying 
so over time (p21; = 0.94). 

To ensure identifiability of Model B, namely the latent Markov chain model in 
Equations (8.6a)-(8.6c), an alternative strategy to constraining parameters is adopted; 
specifically, it is assumed that one individual with the pattern ‘no depressed’ at all four 
periods is in latent class 1, and one individual with the response ‘depressed’ at all periods 
is in state 2. With this form of (data based) prior there is early convergence in a two 
chain run of 5000 iterations. 

The substantive pattern identified under this model (Table 8.8) is in a sense more clear 
cut than Model A, since it identifies a predominantly non-depressive latent class 
(defined by W,) with high initial probability 6; = 0.80, which has virtually no chance 


Table 8.7 Depression state, Model A parameters 


Mean St. devn. 2.50% 97.50% 


õi 0.346 0.079 0.193 0.495 
81 0.654 0.079 0.505 0.807 
801 0.907 0.020 0.871 0.948 
820 0.093 0.020 0.052 0.129 
nı 0.150 0.041 0.086 0.247 
Nb 0.850 0.041 0.754 0.914 
pii 0.419 0.135 0.139 0.661 
pie 0.581 0.135 0.339 0.861 
P121 0.398 0.053 0.292 0.502 
fiz 0.603 0.053 0.498 0.708 
P211 0.938 0.009 0.920 0.956 
pais 0.062 0.009 0.044 0.080 
P221 0.883 0.059 0.769 0.990 
Poss 0.117 0.059 0.010 0.231 


? Tfx, ~ G(w, land x ~ G(w, 1), yı = x1 /ZXj, y2 = X2/Dx;, then {y1, y2} are Dirichlet with weight vector (w, w). 
3 Null initial values in one chain, and the other based on Langeheine and van de Pol. 
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Table 8.8 Depression state, Model B 


Mean St. devn. 2.5% 97.5% 
òi 0.804 0.033 0.733 0.868 
ò 0.196 0.033 0.138 0.261 
Ny 0.990 0.009 0.967 1.006 
Ny 0.010 0.009 0.000 0.027 
N21 0.165 0.048 0.075 0.259 
No 0.835 0.048 0.737 0.929 
Pil 0.945 0.010 0.927 0.965 
Pir 0.055 0.010 0.034 0.074 
P21 0.359 0.048 0.264 0.453 
Po 0.641 0.048 0.549 0.735 


(nız = 0.01) of moving to the other latent state defined by p. The conditional prob- 
ability of being non-depressed, given y; = 1 is 0.945. The other latent transition variable 
is more mixed in substantive terms, and includes a small group of non-depressed who 
are not certain to stay so. 


Example 8.7 Ante-natal knowledge This example applies the model of Palt and Lin 
(Section 8.4.3) to data from an ante-natal study reported by Hand and Crowder (1996, 
p. 205). The observations consist of four originally continuous knowledge scales ob- 
served for 21 women before and after a course. There are nine treatment subjects who 
received the course, and 12 control subjects. The four variates are scales with levels 0-5, 
0-20, 0-30 and 0-5. 


For illustrative purposes, the original data are dichotomised with Yı = 1 if the first 
scale is 5, 0 otherwise, Y) = 1 if the second scale exceeds 15, Y3 = 1 if the third scale 
exceeds 22, and Y4= 1 if the fourth scale exceeds 4. Hence, the model (for 
k=1,4;t=1, 2)is 


Yur =1 if YE >0 
Ye =O if Yd, <0 
Vike = Bkn Wiki 

Ey = M MXi en 


Cit = P1 Gi, t-1 uP 
y er tes Q) 
ikt = P2Wi, k, t-1  Uikt 
with of = var(u1), 05 = var(u2). The only covariate is the fixed treatment variable x (i.e. 
the course on knowledge) with coefficient ħ2. N(0, 10) priors are adopted for 
B = (B2, B3, B4) and X= (Ay, M2), and a G(1, 1) prior for 1/03 -o7 is then obtained 
via the constraint of 03 = 1. 

A two chain run of 10000 iterations (with convergence from 1500) shows no evidence 
of a treatment effect on the underlying knowledge scale &; over patients i and periods t 
(Table 8.9). The scores on this scale for individual women show improvements among 
the control group (e.g. compare & with &, for subjects 4 and 6), as well as the course 
group. There is a high intra-cluster correlation p, governing measurement errors wi; On 
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Table 8.9 Ante-natal knowledge posterior parameter summaries 


Mean St. devn. 2.5% 97.5% 
E11 0.59 0.49 —0.26 1.66 
E12 0.58 0.47 —0.24 1.59 
61 0.13 0.39 —0.66 0.94 
& > 0.77 0.47 —0.03 1.80 
&3, 1 —0.70 0.49 —1.82 0.14 
3.9 0.47 0.49 —0.34 1.58 
E41 —0.77 0.51 —1.87 0.04 
E42 0.88 0.57 —0.02 2.24 
E51 —0.24 0.40 —1.11 0.46 
E52 0.96 0.55 0.07 2.26 
&6, 1 —0.75 0.47 —1.83 0.07 
&,2 0.82 0.50 0.00 1.95 
E71 —0.39 0.42 —1.32 0.34 
E7, 2 1.10 0.57 0.16 2.34 
Es —0.14 0.39 —0.94 0.61 
Es 2 0.44 0.45 —0.35 1.42 
£1 —0.67 0.47 —1.69 0.12 
Éo, 2 1.05 0.64 0.13 2.67 
E10, 1 —0.24 0.39 —1.07 0.49 
E10, 2 0.42 0.45 —0.34 1.45 
E1101 —0.73 0.47 —1.78 0.10 
E11, 2 0.50 0.51 —0.35 1.63 
E101 0.06 0.37 —0.67 0.82 
Ei, 2 0.93 0.55 0.07 2.20 
E131 —0.63 0.49 —1.69 0.24 
E13, 2 —0.65 0.53 —1.86 0.27 
E141 —0.87 0.54 —2.15 0.00 
E14, 2 0.40 0.44 —0.39 1.34 
E15, 1 —0.88 0.55 —2.20 —0.03 
E15, 2 0.36 0.42 —0.40 1.29 
£16, 1 —0.16 0.42 —1.01 0.63 
E16, 2 —0.17 0.41 —1.02 0.61 
E17, 1 —0.03 0.42 —0.93 0.74 
E17, 2 0.93 0.55 —0.02 2.12 
Eig. —0.80 0.52 —2.03 0.06 
Eig, 2 0.03 0.40 —0.75 0.85 
E19, 1 0.57 0.50 —0.31 1.65 
E19, 2 0.58 0.53 —0.34 1.76 
£50, 1 —0.39 0.42 —1.31 0.37 
£0, 2 0.14 0.39 —0.59 0.99 
101 —0.06 0.40 —0.90 0.70 
G1, 2 0.74 0.51 —0.11 1.93 
N 0.21 0.22 —0.21 0.67 
A2 —0.05 0.24 —0.52 0.41 
B2 2.40 1.09 0.80 4.97 
B3 2.93 1.31 0.97 5.93 
B4 4.04 1.48 1.44 7.34 
Pi —0.20 0.31 —0.75 0.42 
Po 0.89 0.10 0.64 0.98 
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the same item at different time points, but the autocorrelation in the latent construct is 
lower (in fact, biased to negative values). 


Example 8.8 Factor structures at two levels This example replicates the analysis by 
Longford and Muthen (1992), in which metric data Yj, are generated for i= 1,..10 
subjects within j = 1,..20 clusters for m = 1,..5 variables, as in Equation (8.8), namely 


Yim = Km Ami P1j Am2 Pry Vi Vii Vn U2jim  Ulijm 


This example illustrates how identification of the assumed parameters from the data 
thus generated requires constrained priors on the loadings to ensure consistent labelling 
of the constructs during sampling. In the Longford and Muthen simulation, the means 
K = {K], . . K5) are zero, and the level 1 variances Var(u jj) are 1. The level 2 variances 
Var(u2jm) are 0.2. Also in the simulation, the loadings at level 1 and 2 are taken to be the 


same, i.e. 
T 
is, 1 1 1 1 
Kata? AG a 


The dispersion matrices of the factor scores {,, Y2} at level 1 and {@, >} at level 2 


are, respectively, 
1 0 
m=(9 4) 


1.25 1 
»=( 1 PA 


In estimation of an appropriate model from the data thus generated (i.e. coming to the 
data without knowing how it was generated), one might adopt several alternative prior 
model forms. The assumption A = Į might in fact be taken on pragmatic grounds to 
improve identifiability of the level 2 loading matrix. Here for illustration this is not 
assumed. 

Then with A and [ independent, minimal identifiability requires one of the loadings 
at each level must take a preset value. Here it is assumed that yı; = àq; = 1. The level 
1 and 2 factor variances are taken as preset at one, and with no correlation between 9); 
and z. Given the small cluster sizes, the observation variances, and level 2 loadings 
may all not be reproduced that closely. Identifiability was further ensured by assuming 
the first factor at each level is ‘unipolar’ (has consistently positive loadings in relation to 
the indicators Y), and by defining the second factor at each level as bipolar, for instance 
constraining y; to be positive and y», to be negative. In any particular confirmatory 
factor analysis, such assumptions would require a substantive basis. 

On this basis a two chain run of 5000 iterations shows convergence at under 1500 
iterations, and shows estimated level 1 loadings reasonably close to the theoretical 
values (Table 8.10). The observational variances are also reasonably closely estimated. 
The level 2 loadings also broadly reproduce the features of the theoretical values. 

In the absence of knowledge of the mode of data generation, one might alternatively 
(a) adopt a single level 2 factor, while still retaining a bivariate factor structure at level 1, 
or (b) retain a bivariate level 2 factor but take A=T. 


and 
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Table 8.10 Two level factor structure, parameter 
summary 


Mean St. devn. 2.5% 97.5% 


Level 1 loadings 


yi2 1.19 0.29 0.55 1.62 
Yo 1.09 0.18 0.70 1.40 
Yoo —0.92 0.21 —1.38 0.59 
Y1 0.82 0.10 0.61 1.02 
Yo —0.10 0.16 —0.44 0.16 
Ya1 0.85 0.19 0.42 1.18 
Yao —0.87 0.17 -1.22  —0.57 
Ys1 0.89 0.25 0.54 1.46 
Ys2 1.08 0.16 0.74 1.39 


Level 2 loadings 


N12 1.13 0.35 0.42 1.84 
N21 0.60 0.27 0.11 1.21 
A22 —0.46 0.27 —1.10 —0.04 
N31 0.47 0.22 0.08 0.91 
N32 0.22 0.27 —0.31 0.71 
Nal 0.36 0.21 0.03 0.82 
M42 —0.38 0.23 —0.86 0.02 
N51 1.09 0.27 0.52 1.58 
N52 0.89 0.39 0.08 1.61 


Level 1 variances 


Var(uy1) 0.71 0.49 0.06 1.73 
Var(uy) 0.43 0.24 0.06 0.93 
Var(uy3) 0.94 0.12 0.72 1.20 
Var(u14) 1.45 0.24 0.99 1.93 
Var(uys) 1.18 0.48 0.10 1.87 


Level 2 variances 


Var(uz1) 0.13 0.11 0.03 0.43 
Var(uz2) 0.15 0.12 0.03 0.47 
Var(u3) 0.26 0.15 0.05 0.62 
Var(uz) 0.10 0.07 0.02 0.27 


Var(u25) 0.22 0.17 0.03 0.65 


Example 8.9 Toxicity in mice This example uses simulated data on reproductive 
toxicity in mice, drawing on the work of Dunson (2000) concerning the toxicological 
impacts of the solvent ethylene glycol monomethyl ether (EGMME). Dunson analyses 
data on litters į from parental pairs j, with n — ne = 132 litters born to pairs exposed to 
EGMME and n, = 134 litters to control pairs not exposed. There are up to five litters 
per pair, and two outcomes for each litter, namely Y; binary and Yj Poisson, with 
Yj, = 1 if the birth was delayed (i.e. prolonged birth interval) and Yjj7 relating to litter 
size. Litter size has an effective maximum of 20. 
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The observed variables are linked to underlying Poisson variables Yj and 
Yip = (Vii, Yio, .... Yom), with M = 20. Let X;=1 for exposed pairs and X; = 0 
otherwise. Then the Poisson means are 


E( Yu) =exp(Bj; PX Gy vy) 
and for m= 1,.. M, where M = 20 is the maximum litter size 


E(Yğom) = exp (Bz; B4Xi Py Uy) 


The Bı; and B4; are intercepts specific to each of the five possible litters per pair; thus B4; 
is the intercept specific to the first litter (when 7 = 1), and so on. B, and By are exposure 
effects on times between births and on litter size. The correlations between outcomes are 
modelled via the common error ,;. The observed indicators are defined according to 
hits and no hits mechanisms, respectively, which is cumulated in the case of Yj. Thus 


Yi = (Yj = 1) 


M 
Yin = X òY fn = 0) 
m=1 
where 8(u) equals 1 if condition u holds, and zero otherwise. Note that Y3 = 0 repre- 
sents ‘no defect preventing successful birth’, so that cumulating over 8( Y3* = 0) gives the 
number of mice born. However, the actual complementary log-log model involves the 
chance Y3* > 1 of a defect at each m. 

Here data are simulated on 270 litters (five litters for ne = 27 control pairs, and five 
litters for ne = 27 exposed pairs) using the parameters supplied by Dunson (2000, p. 364). 
In re-estimating the model, N(0, 1) priors are assumed on the parameters B, and By, 
together with the informative priors of Dunson (2000) on B4; and B3; and the precisions of 
P1; Py and py. The complementary log-log link is used to reproduce the hits mechanism. 
A two chain run of 2000 iterations, with convergence after 250, gives estimates for B, and 
B4 parallel to those of Dunson (2000, p. 364); the posterior means and 95% credible 
intervals for these parameters are 1.92 (1.44, 2.40) and 0.32 (0.15, 0.48), respectively. 

The interpretation is that exposure to the EGMEE delays births and reduces litter 
sizes. There are several possible sensitivity analyses, including the extent of stability in 
parameter estimates under less informative priors on the precisions. 

Here an alternative model including explicit loadings on the constructs 9); and @; is 
also investigated. In this model, the two constructs have variance 1, but are allowed to 
be correlated (with parameter œw). On these assumptions, the loadings {\;,.., 4) in the 
following model may be identified: 


Yn = 8(VA> 1) 
M 

Yin = J 8(Yiim = 0) 
m=1 


EYR) =exp(By Mey W) = 1,5; 7=1, ne) 

EYR) = exp(Bii Bo Mey by) G@=1,5;fane 1,n) 

E( Yim) = exp (B; Mey W) = 1, 5; f= 1, n; m= 1,..M) 
E(Ybm) = exp (Bs, Ba Mey vy) G=1,5;f=1,n. 1,n;m=1,..M) 


M and dq are constrained to be positive for identifiability. 
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Table 8.11 Factor model for birth outcomes 


Mean St. devn. 2.5% 97.5% 
B2 1.92 0.23 1.48 2.38 
B4 0.27 0.08 0.12 0.43 
NI 0.00 0.31 —0.59 0.65 
A2 0.41 0.20 0.09 0.87 
A3 0.82 0.38 0.22 1.69 
Ng 1.19 0.52 0.36 2.50 


w 0.34 0.23 0.09 0.97 


A two chain run of 2000 iterations shows there is a positive correlation coefficient 
between the factors (albeit for these simulated data). There are also higher loadings ^2 
and \4 for the exposed group (as compared to N; and A3, respectively) on the parent 
level factors which represent chances of birth delay and birth defect, respectively. In real 
applications, this might represent excess risk beyond that represented by the simple 
dummy for exposure to EGMEE. 


8.5 LATENT STRUCTURE ANALYSIS FOR MISSING DATA 


In structural equation models, including confirmatory factor models, it may be that 
latent variables rather than (or as well as) observed indicators contribute to predicting 
or understanding missingness mechanisms. Sample selection or selective attrition that 
lead to missing data may be more clearly related to the constructs than to any combin- 
ation of the possible fallible proxies for such constructs. As above (Chapter 6), assume 
the full set of observed and missing indicator data is denoted Y = { Yobs, Ymis }where 
Ymis is of dimension M, and that the observed data includes an n x M matrix of binary 
indicators Rim corresponding to whether Yj is missing (Rim = 1) or observed (Rim = 0). 

Maximum likelihood and EM approaches to missing data in structural equation and 
factor analysis models are considered by Rovine (1994), Arbuckle (1996) and Allison 
(1987). As noted by Arbuckle (1996), the methods developed are often based on the 
missing at random assumption, or assume special patterns of missingess, such as 
the monotone pattern. Under monotone missingness, one might have completely ob- 
served variable X for all n subjects, a variable Y observed for only nı subjects, and a 
variable Z fully observed for only m subjects and observed only when Y is (the m 
subjects are then a subsample of the nı). Then the likelihood may be written 


N2 


Teo [rx oT [rain X 0 
i=l i=l i=1 


Under the MAR assumption, the distribution of R depends only upon the observed 
data, so 


JCR] Y, œ) = f (R| Yobs, ) 


whereas in many situations (e.g. attrition in panel studies) the attrition may depend upon 
the values of the indicators that would have been observed in later waves. An example of 
this non-random missingness is based on the Wheaton et al. (1977) study into alienation, 
considered in Example 8.11. Since the interest is in accounting for missingness via 
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latent constructs y based on the entire Y matrix, a non-random missingness model 
might take the form f(R|, œ). 


Example 8.10 Fathers’ occupation and education Allison (1987) considers a study 
by Bielby et al. (1977) which aimed to find the correlation between father’s occupational 
status and education for black men in the US. With a sample of 2020 black males, 
Bielby et al. found a correlation of 0.433, but realised this might be attenuated 
by measurement error. They therefore re-interviewed a random sub-sample of 348 
subjects approximately three weeks later, and obtained replicate measures on status 
and education. Let yı and y3 denote the first measures (on all 2020 subjects) relating 
to status and education, respectively. For the sub-sample, observations are also 
obtained on y2 and y4, repeat measures of status and education, respectively. For 
the 1672 subjects remaining of the original sample, these two variables are then 
missing. 


On this basis, one may assume, following Allison, that the missing data are missing 
completely at random — though sampling mechanisms may generate chance associations 
which invalidate the intention of the design. Since Rj and R;4 are either both 1 or both 
zero, a single response indicator R; = 1 for y2 and y4 present and R; = 0, otherwise may 
be adopted. 

For the 348 complete data subsample, the observed means are {16.62, 17.39, 6.65, 
6.75} and the variance-covariance matrix is 


180.9 126.8 24.0 22.9 
126.8 217.6 30.2 30.5 
24.0 30.2 16.2 14.4 
22.9 30.5 14.4 15.1 


while for the larger group the means on yı and y3 are 17 and 6.8, with variance- 
covariance matrix 


217.3 25.6 
25.6 16.2 


Bielby et al. and Allison assume that the data were generated by two underlying factors 
(‘true’ occupational and educational status) with 


Ves eae (8.10a) 
Ves, Doi, (8.10b) 
en eee (8.10c) 
Yai = Kq Ng; (8.10d) 


Both Bielby et al. and Allison take , and J, to have a free dispersion matrix which 
allows for covariation between the factors. This option means a constraint A; = M = 1 
is needed for identifiability. Alternatively, one might take y; and is, to be standardised 
variables with their dispersion matrix containing a single unknown correlation param- 
eter p. In this case, all the à parameters are identifiable. If in fact missingness is not 
MCAR, then it will be related either to the known observations Yı and Y3, or to the 
partially unknown observations Y and Y4, or to the factor scores, y; and y. 
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Table 8.12 Fathers’ occupation and education 


Mean St. devn. 2.5% Median 97.5% 


Coefficients in missingness model 


Bo 1.362 0.223 0.933 1.359 1.805 


By 0.005 0.009 —0.013 0.005 0.022 
B2 0.019 0.031 —0.041 0.019 0.079 


Correlation between factors 


p 0.608 0.047 0.511 0.609 0.696 


Free loadings 


A2 1.869 0.140 1.627 1.857 2.129 
N4 1.024 0.047 0.938 1.022 1.123 


The analysis of Allison is replicated here with an original sample size of 505, and later 
subsample of size 87. An open mind on the response mechanism is retained, and it is 
assumed that the probability of non-response m; = P(R; = 1) may be related (via a logit 
link) to Yı and Y; in line with MAR response. The response model is then 


logit(m;) = Bo Bi B2Y3: (8.11) 


and priors B; ~ N(0, 1) are assumed. The measurement model in Equation (8.10) is 
applied to all 592 subjects, regardless of observation status. 

Parameter summaries are based a single chain run taken to 120000 iterations (with 
5000 burn in) for estimates of {Bo, By, B2}, {Az2, ^4} and the correlation p between the 
two factors (and hence between true social and educational status). The diagnostics of 
Raftery and Lewis (1992) suggest this number is required because of a high autocorrela- 
tion in the samples, especially of M and ^4. The inter-factor correlation of 0.61 (Table 
8.12) compares to the estimate of 0.62 cited by Allison (1987, p. 86). There is in fact no 
evidence of departure from MCAR in the data as sampled, in the sense that 8, and B, in 
Model (8.11) are not different from zero (their 95% credible intervals straddle zero). The 
missingness model is though subject to possible revision, for example taking only Y3 as 
the predictor. Subject to identifiability, response models including { Y2, Y4} or {W1, >} 
may also be investigated. 


Example 8.11 Alienation over time This example considers adaptations of the data 
used in a structural equation model of alienation over time as described by Wheaton 
et al. (1977), originally with n = 932 subjects. In a reworked analysis of simulated data 
from this study reported by Muthen et al. (1987), there are six indicators of two 
constructs (social status and alienation) at time 1, and three indicators of alienation at 
time 2. A slightly smaller number of subjects (600) was assumed. 


Denote social status and alienation at time 1 by & and ,, and alienation at time 2 
by p. The original indicators for i= 1,..600 subjects are standardised, with three 
indicators X;—X3 at time 1 related to the social status (exogenous) construct as 
follows: 


X= Mič; u 
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Xai = Gi Mhi 

X3i = Mi; Ui 
where uj, u) and u3 are independently univariate Normal. The three indicators of 
alienation are denoted Yı, Y>} and Y3, at time 1 and Y2, Yo. and Y3 at time 2. 
They are related to the alienation construct at times 1 and 2 as follows: 

Yin =A; ai 

Your = ooh); Usi 

Yz; = Azo; Uo: 
and 

Yini = MỌ; Ui 

Yni = MỌ; Ugi 

Y32i = Agatha; Uoi 


The constructs themselves are related first by a cross-sectional model at time 1, namely 


bi = Bu Wi (8.12a) 
and by a longitudinal model relating time 2 to time 1, namely 
Wo; = Bai; Bopi wz (8.12b) 


Thus, alienation at time 2 depends upon alienation at time 1 and status at time 1. 

Muthen et al. use various models to simulate missingness, which is confined to the 
wave 2 indicators of alienation, and applies to all items for non-responding subjects. 
Thus, let R; be a binary indicator of whether a subject is missing at wave 2 (i.e. unit 
rather than item non-response) with R; = 1 for response present and R; = 0 for response 
missing (this coding for R is used to be consistent with Muthen et al.). Underlying this 
binary indicator is a latent continuous variable R¥, which is zero if R* < 7. 

One model for the R¥ assumes they are related only to fully observed (i.e. first wave) 
data X;; and Yki: 


Rt = 0.6670*(X1; Xj X3;) 

—0.333w*(Yini Yai Yz) ð; (8.13) 
where 6; ~ N(0, 1) and w* = 0.329. The cut off t is taken as —0.675. This is missingness 
at random (only depending on observed data), and leads to a missingness rate of around 
25%. 

Another choice makes missingness depend upon both the wave | and 2 outcomes 
whether observed or not, so that the missingness mechanism is non-ignorable. Thus, 
Rt = 0.6670*(X1; Xai X3i) 
— 0.3330*(Yi You Yai Yiu Yni Ysu) ð; (8.14) 
In this case, Muthen et al. varied the degree of selectivity by setting w* at 0.33, 0.27 or 
0.19. 


A further option, also non-ignorable, makes missingness depend upon the latent 
factors so that 


R¥ = 0.667w*E; — 0.3330*(s,, Y) 8; (8.15) 
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with w* = 0.619. This is non-ignorable, because , is defined both by observed and 
missing data at phase 2. 

Accordingly, values of (X;, j = 1, 3) { Yj, j = 1, 3} and { Yp, j = 1, 3} are generated 
and sampled data at wave 2 then removed according to the missingness model. One may 
then compare (a) the estimates of the parameters {A, B, var(w;), var(um)} using the 
original data with no imputed non-response (b) the parameters obtained when adopting 
a missingness model based only on a MAR mechanism, and (c) the parameters obtained 
adopting a missingness model based on the latent factors, for example as in Equation 
(8.15). Under (b), logit models for R; or R¥ depend upon the fully observed observations 
at wave 1, and under (c) such models depending on the constructs W, Y, and &. 

Using the 9 x 9 correlation matrix provided by Muthen et al., a full data set may be 
generated and missingness then imputed according to Equation (8.13), (8.14) or (8.15). 
We adopt the option in Equation (8.14), where missingness is related to all indicators, 
whether subject to non-response at wave 2 or not, and take w* = 0.27. To generate the 
data, it is necessary to sample all the {X;;, Yj} and then ‘remove’ the sampled data for 
missing cases where Rj is under the threshold. The form of the missingness models 
(8.13)-(8.15) means there is either complete non-response at wave 2 or complete 
response at unit level on all three indices. So individual item response indices Rj may 
be replaced by a single unit response index, R; = 0 for all missing observations at wave 
2, and R; = 1 otherwise. There are 169 of the 600 observations with missingness at 
wave 3, a rate of 28%. 

Under the response mechanism in Equation (8.14), attrition is greater for lower status 
persons and more alienated persons: so missingess might be expected to be greater for 
subjects with higher scores on , and y2. In Model B in Program 8.11, the response 
model relates 7; = Pr(R; = 1) to the factor scores, namely, 


logit(m;) = wo œi; wp; 3, (8.16) 


We then obtain the expected negative impacts on response of alienation (is, and p) anda 
positive impact of status, ë (see Table 8.13 obtained from iterations 500-5000 ofa two chain 
run). The impact of is, is as might be expected, less precisely estimated than that of y; Other 
models relating m; to (say) just € and W might be tried. The coefficients of the structural 
model (8.12) are close to the parameters obtained from the fully observed sample, though 
the negatively signed impact B»; of social status € on alienation y, at time 2 is enhanced. 
Instead one might assume a model with no information to predict missingess, i.e. 


logit(a7;) = ao (8.17) 


(This is pi.1[] in Model B in Program 8.11.) In the present case, and with the particular 
sample of data from the covariance matrix of Muthen et al., this produces very similar 
estimates of structural and measurement coefficients to the non-ignorable model. 

Both models in turn provide similar estimates of the parameters to those based on the 
fully observed data set of 600 x 9 variables (Model C in Program 8.11). Model (8.16) 
allowing for non-ignorable missingness provides an estimate for \23 closer to the full 
data parameter, but B»; is better estimated under the MCAR model (8.17). So for this 
particular sampled data set, there is no benefit in using a missingness model linked to 
values on the latent constructs. However, to draw firm conclusions about the benefits of 
ignorable vs. non-ignorable missingess it would be necessary to repeat this analysis with 
a large number of replicate data sets. 
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Table 8.13 Alienation study missing data, parameter summary 


Mean St. devn. 2.50% Median 97.50% 


Missingness coefficients 


1 1.13 0.12 0.91 1.12 1.37 


w2 0.70 0.26 0.19 0.70 1.21 
03 —0.61 0.32 —1.24 —0.61 0.03 
w4 —0.50 0.47 —1.48 —0.48 0.42 


Structural model coefficients 


Bi —0.57 0.06 —0.69  —0.57  —0.46 
Ba —0.31 0.07 —0.46  —0.31  —0.17 
Boy 0.54 0.07 0.40 0.54 0.69 


Measurement model coefficients 


Ai 1.00 
A22 1.06 0.07 0.93 1.05 1.20 
N31 0.71 0.06 0.59 0.70 0.82 
A12 1.00 
A22 1.00 0.06 0.88 0.99 1.11 
N32 0.72 0.05 0.62 0.72 0.83 
N13 1.00 
N23 0.89 0.09 0.73 0.89 1.08 


N33 0.56 0.08 0.41 0.56 0.72 


8.6 REVIEW 


Despite some long-standing Bayesian discussion of certain aspects of factor analysis 
and structural equation modelling (e.g. Lee, 1981; Press and Shigemasu, 1989), recent 
MCMC applications have occurred at a relatively low rate compared to other areas. 
This may in part reflect the availability of quality software adopting a maximum 
likelihood solution. Some of the possible advantages of Bayesian analysis are suggested 
by Scheines et al. (1999) and Lee (1992) in terms of modifying formal deterministic 
constraints to allow for stochastic uncertainty. Recent developments introducing struc- 
tural equation concepts into multi-level analysis are discussed by Jedidi and Ansari 
(2001), who include an application of the Monte Carlo estimate (Equation (2.13b) in 
Chapter 2) of the CPO to derive pseudo Bayes factors. 

However, considerable issues in the application of repeated sampling estimation 
remain relatively unexplored: whereas label switching in discrete mixture regression 
is well documented (see Chapters 2 and 3), the same phenomenon occurs in models 
of continuous latent traits. Similarly in latent class models with two or more latent 
class variables (e.g. latent class panel models as in Section 8.4.2) there are complex 
questions around consistent labelling of all such variables and how far constraints might 
restrict the solution. Bayesian SEM applications with discrete data are also relatively 
few. 
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EXERCISES 


1. In Example 8.2, try the more usual Normal density assumption for the indicators 
(equivalent to Wim = 1 by default for all 7) and assess fit against the Student ¢ model 
using the Gelfand-Ghosh or DIC criterion. The latter involves the likelihood com- 
bined over all indicators Xm. 


2. In Example 8.3, try estimating both models with a non-zero intercept yọ in the 
structural model (with an N(0, 1) prior, say). Does this affect the relative perform- 
ance of the two models? 


3. In Example 8.4, add second order interaction parameters between the items and LC 
in the log-linear model for the G;,;, and compare fit with the model confined to first 
order interactions. 


4. In Example 8.5, try a single factor model and evaluate its fit against the two factor 
model. 


5. In the Ante-natal Knowledge example, try an analysis without preliminary dichot- 
omisation but regrouping the data into ordinal scales. 


6. In Example 8.8, try alternative ways to possibly improve identifiability, namely (a) 
assuming that level 1 and 2 loadings are the same, and (b) that only one factor is 
relevant at level 2. Set up the Normal likelihood calculation and assess changes in 
DIC or a predictive criterion. 


7. In Example 8.10, repeat the analysis with only Y3 included in the missingness model. 
Assess the change in DIC as compared to the model used in the Example, where the 
relevant deviances are for both response indicators and observed outcomes. 
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CHAPTER 9 


Survival and Event History 
Models 


9.1 INTRODUCTION 


Processes in the lifecycle of individuals including marriage and family formation, 
changes in health status, changes in job or residence may be represented as event 
histories. These record the timing of changes of state, and associated durations of 
stay, in series of events such as marriage and divorce, job quits and promotions. 
Many applications of event history models are to non-repeatable events such as mor- 
tality, and this type of application is often called survival analysis. Survival and event 
history models have grown in importance in clinical applications (e.g. in clinical trials), 
in terms of survival after alternative treatments, and in studies of times to disease 
recurrence and remission, or response times to stimuli. 

For non-renewable events the stochastic variable is the time from entry into observa- 
tion until the event in question. So for human survival, observation commences at birth 
and the survival duration is defined by age at death. For renewable events, the depend- 
ent variable is the duration between the previous event and the following event. We may 
be interested in differences either in the rate at which the event occurs (the hazard rate), 
or in average inter-event times. Such heterogeneity in outcome rate or inter-event 
durations may be between population sub-groups, between individuals as defined by 
combinations of covariates, or as in medical intervention studies, by different therapies. 
Thus, in a clinical trial we might be interested in differences in patient survival or relapse 
times according to treatment. 

Whereas parametric representations of duration of stay effects predominated in early 
applications, the current emphasis includes semiparametric models, where the shape of 
the hazard function is essentially left unspecified. These include the Cox proportional 
hazards model (Cox, 1972) and recent extensions within a Bayesian perspective such as 
gamma process priors either on the integrated hazard or hazard itself (Kalbflesich, 
1978; Clayton, 1991; Chen et al., 2000). While the shape of the hazard function in time is 
often of secondary interest, characteristics of this shape may have substantive implica- 
tions (Gordon and Molho, 1995). 

Among the major problems that occur in survival and inter-event time modelling is a 
form of data missingness known as ‘censoring’. A duration is censored if a respondent 
withdraws from a study for reasons other than the terminating event, or if a subject does 
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not undergo the event before the end of the observation period. Thus, we know only 
that they have yet to undergo the event at the time observation ceases. This is known as 
‘right censoring’, in that the observed incomplete duration is necessarily less than the 
unknown full duration until the event. Other types of censoring, not considered in the 
examples below, are left censoring and interval censoring. In the first, subjects are 
known to have undergone the event but the time at which it occurred is unknown, 
while in the second it is known only that an event occurred within an interval, not the 
exact time within the interval. 

Another complication arises through unobserved variations in the propensity to 
experience the event between individual subjects, population groups, or clusters of 
subjects. These are known as ‘frailty’ in medical and mortality applications (Lewis 
and Raftery, 1995). If repeated durations are observed on an individual, such as 
durations of stay in a series of jobs, or multiple event times for patients (Sinha and 
Dey, 1997), then the cluster is the individual employee or patient. The unobserved 
heterogeneity is then analogous to the constant subject effect in a panel model. Given 
the nature of the dependent variable, namely the length of time until an event occurs, 
unmeasured differences lead to a selection effect. For non-renewable events such as 
human mortality, high risk individuals die early and the remainder will tend to have 
lower risk. This will mean the hazard rate will rise less rapidly than it should. 

A third major complication occurs in the presence of time varying covariates and here 
some recent approaches to survival models including counting processes (Andersen et 
al., 1993; Fleming and Harrington, 1991) are relatively flexible in incorporating such 
effects. In event history applications, counting processes also allow one to model the 
effect of previous moves or durations in a subject’s history (Lindsey, 2001). 

Survival model assessment from a Bayesian perspective has been considered by 
Ibrahim et al. (2001a, 2001b) and Sahu et al. (1997), who consider predictive loss criteria 
based on sampling new data; and by Volinsky and Raftery (2000), who consider the 
appropriate form of the Bayesian Information Criterion (BIC). Pseudo-Bayes factors 
may also be obtained via harmonic mean estimates of the CPO (Sahu et al., 1997; Kuo 
and Peng, 2000, p. 261) based on the full data. 

Volinsky and Raftery suggest that the multiplier for the number of parameters be not 
log(n) but log(d), where n and d are, respectively, the total subjects and the observed 
number of uncensored subjects. Then if £ is the log-likelihood at the maximum likeli- 
hood solution and p the number of parameters 


BIC = —2¢  plog(d) 


Another version of this criterion, namely the Schwarz Bayesian Criterion (SBC), is 
proposed by Klugman (1992). This includes the value of the prior (8) at the posterior 
mean 6, and log-likelihood (8), so that 


SBC = (6) (6) — plog (n/a) 


where the last term on the right-hand side involves 7 = 3.1416. The AIC, BIC and SBC 
rely on knowing the number of parameters in different models, but the often high level of 
missing data through censoring means, for instance, that the true number of parameters is 
unknown and the method of Spiegelhalter et a/. (2001) might therefore be relevant. 
Predictive loss methods may be illustrated by an adaptation of the Gelfand and 
Ghosh (1998) approach; thus, let ¢; be the observed times, uncensored and censored, 0 
the parameters, and z; the ‘new’ data sampled from f(z|0). Suppose v; and c; are the 
mean and variance of z;, then, following Sahu et al. (1997), one criterion for any w > 0 is 
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D=X g [w/w DIS m u? (9.1) 
i=1 i=1 


where u; = max (v;, si) if s; is a censored time and u; = t; if the time is uncensored. 


9.2 CONTINUOUS TIME FUNCTIONS FOR SURVIVAL 


Suppose event or survival times T are recorded in continuous time. Then the density f (t) 
of these times defines the probability that an event occurs in the interval (t,t dt), 
namely 


f= lim Pr(t<T<t dt)/dt 
t— 


with cumulative density 


F(t) = [roa 


From this density the information contained in duration times can be represented in two 
different ways. The first involves the chance of surviving until at least time ¢ (or not 
undergoing the event before duration ft), namely 


S(t) = Pr(T >) =1-F() 
= [fendi 


The other way of representing the information involves the hazard rate, measuring the 
intensity of the event as a function of time, 


A(t) = f(0)/S 


and in probability terms, the chance of an event in the interval (t, £ dt) given survival 
until ¢t. From A(t) is obtained the cumulative hazard H(t) = i h(u)du, and one may also 
write the survivor function as S(t) = exp ( — A(d). 

As an example of a parameterised form of time dependence, we may consider the 
Weibull distribution for durations W(A, y), where à and y are scale and shape param- 
eters, respectively (Kim and Ibrahim, 2000). The Weibull hazard is defined as 


h(t) = dt"! 
with survival function 
S(t) = exp(— df”) 
and density 
S(t) = dy exp(— Af) 


The Weibull hazard is monotonically increasing or decreasing in time according to 
whether y > 1 or y < 1. The value y = 1 leads to exponentially distributed durations 
with parameter ÀN. 

To introduce stationary covariates x of dimension p, we may adopt a proportional 
form for their impact on the hazard. Then the Weibull hazard function in relation to 
time and the covariates is 


364 SURVIVAL AND EVENT HISTORY MODELS 
h(t, x) = de® yt! (9.2) 


Under proportional hazards, the ratio of the hazard rate at a given time ¢ for two 
individuals with different covariate profiles, x; and x2 say, is 


h(t, x1)/A(t, x2) = exp (B(xı — X2)) 


which is independent of time. 

An equivalent form for the Weibull proportional hazards model in Equation (9.2) 
(Collett, 1994) involves a log-linear model for the durations ¢; and assumes a specified 
error u;, namely the extreme value (Gumbel) distribution. Then 


log(t))=v ax; ou; (9.3) 
where, in terms of the parameters in Equation (9.2), the scale is 
o =1/y 
the intercept is 
v = —o log (A) 
and the covariate effects are 
aj=—-Bjo j=l,...,p 


Taking u; as standard Normal leads to a log-Normal model for durations, while taking 
u; as logistic leads to the log-logistic model for t (Lawless, 1982; Fahrmeir and Tutz, 
2001). 

In BUGS the Weibull density for durations is routinely implemented as 

tli] ~ dweib(lambda[i],gamma), 

where the log of lambda[i] (or possibly some other link) is expressed as a function of an 
intercept and covariates, and gamma is the shape parameter. While the Weibull hazard 
is monotonic with regard to duration ft, a non-monotonic alternative such as the log- 
logistic may be advantageous, and this may be achieved in BUGS by taking a logistic 
model for y = log (t). Here, t are observed durations, censored or complete. 

Thus, 


yi ~ Logistic(w,, K) (9.4) 


where k is a scale parameter and p; is the location of the ith subject. The location may be 
parameterised in terms of covariate impacts w; = Bx; on the mean length of log survival 
(rather than the hazard rate). The variance of y is obtained as m/(3k7). The survivor 
function in the y scale is 


S(y)=[1 exp({y- p}/o)T! (9.5) 
where o = 1/k. In the original scale, the survivor function is 
SA =p 4/07 T! (9.6) 


where 0 = e". 


Example 9.1 Reaction times An example of a parametric analysis of uncensored 
data is presented by Gelman et al. (1995, Chapter 16), and relates to response times 
on i=1,.. 30 occasions for a set of j = 1, .. 17 subjects; 11 were not schizophrenic 
and six were diagnosed as schizophrenic. In Program 9.1, the first 11 cases are 
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non-schizophrenic. As well as response times being higher for the latter, there is 
evidence of greater variability in reaction times for the schizophrenics. 


For the non-schizophrenic group a Normal density for the log response times 
Vi = log, ty (i.e. a log-Normal density for response times) is proposed, with distinct 
means for each of the 11 subjects. We might alternatively adopt a heavier tailed density 
than the Normal for the schizophrenic group, but there are substantive grounds to 
expect distinct sub-types. Specifically, delayed reaction times for schizophrenics may be 
due to a general motor retardation common to all diagnosed patients, but attentional 
deficit may cause an additional delay on some occasions for some or all schizophrenics. 

The observed times for non-schizophrenics are modelled as 


with the means for subjects j drawn from a second stage prior a; ~ N(u, ®). For the 
schizophrenics, the observed times are modelled as 


vi ~ Nla; =TGy, œ) 


ajo N B,D) 


where B and 7 are expected to be positive. The Gj are a latent binary classification of 
schizophrenic times, according to whether the Additional Attention Deficit (AD) 
impact was operative or not. 

We accordingly assign N(0, 1) priors for B and for t, measuring the AD effect, with 
sampling confined to positive values. For the probabilities \; and M of belonging to the 
AD group or not (among the schizophrenic patients) a Dirichlet prior is adopted, with 
weights of 1 on the two choices. Gelman et al. use the equivalent parameterisation 
ht = 1 —Xand M = à with the group indicators drawn from a Bernoulli with parameter 
à. Note that the constraint t > 0 is already a precaution against ‘label switching’ in this 
discrete mixture problem. Convergence of parameters (over a three chain run to 20000 
iterations) is achieved by around iteration 8000 in terms of scale reduction factors 
between 0.95 and 1.05 on the unknowns, and summaries based on the subsequent 
12 000. 

We find an estimated median M of 0.12 (Table 9.1, Model A), which corresponds to 
that obtained by Gelman et al. (1995, Table 16.1). The excess of the average log 
response time for the non-delayed schizophrenic times over the same average for non- 
schizophrenics is estimated at B = 0.32, as also obtained by Gelman et al. 

We follow Gelman et al. in then introducing a distinct variance parameter for those 
subject to attentional deficit, and also an additional indicator F; ~ Bern(w) for schizo- 
phrenic subjects such that Gj can only be 1 when F; = 1. M is now the chance of an AD 
episode given that the subject is AD prone. 

The second half of a three chain run of 20000 iterations leads to posterior means 
M = 0.66, and œ = 0.50 (Table 9.1, Model B). However, using the predictive loss 
criterion of Gelfand and Ghosh (1998) and Sahu et al. (1997), it appears that the 
more heavily parameterised model has a worse loss measure as in Equation (9.1), and 
so the simpler model is preferred. This conclusion is invariant to values of w between 
1 and values of w so large that w/(1 w) is effectively 1. 


Example 9.2 Motorettes Tanner (1996) reports on the analysis of repeated observa- 
tions of failure times of ten motorettes tested at four temperatures. All observations are 
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Table 9.1 Response time models, parameter summary 


Model A Mean St. devn. 2.5% Median 97.5% 


Ny 0.877 0.029 0.814 0.879 0.929 
A2 0.123 0.029 0.071 0.121 0.186 
B 0.317 0.08 0.16 0.317 0.477 
u 5.72 0.05 5.63 5.72 5.81 
T 0.843 0.06 0.729 0.842 0.962 
Model B 

Ai 0.344 0.160 0.054 0.336 0.674 
ho 0.656 0.160 0.326 0.664 0.946 
B 0.261 0.100 0.082 0.253 0.472 
u 5.72 0.04 5.63 5.72 5.81 
T 0.552 0.245 0.228 0.468 1.099 
w 0.500 0.236 0.068 0.527 0.888 


Table 9.2 Failure times of motorettes (* censored) 


Temperature (centigrade) 


Motorette 150 170 190 220 
1 8064* 1764 408 408 
2 8064* 2772 408 408 
3 8064* 3444 1344 504 
4 8064* 3542 1344 504 
5 8064* 3780 1440 504 
6 8064* 4860 1680* 528* 
7 8064* 5196 1680* 528* 
8 8064* 5448* 1680* 528* 
9 8064* 5448* 1680* 528* 

10 8064* 5448* 1680* 528* 


right censored at the lowest temperature, and three motorettes are censored at all 
temperatures (Table 9.2). 

The original times ¢ are transformed via W = logy, (t), and a Normal density pro- 
posed for them with variance o? and means modelled as 


w; =B Boi 


where V; = 1000/(temperature 273.2). For censored times it is necessary to constrain 
sampling of possible values above the censored time; it is known only that the actual value 
must exceed the censored time. For uncensored cases, we follow the BUGS convention in 
including dummy zero values of the censoring time vector (W.cen[] in Program 9.2). 
Tanner obtains o = 0.26, B; = —6.02 and B, = 4.31. We try both linear and quad- 
ratic models in V; and base model selection on the Schwarz criterion at the posterior 
mean. The posterior means on the censored failure times come into the calculations of 
the SBC’s (Models Al and B1 in Program 9.2). Less formal assessments might involve 
comparing (between linear and quadratic models) the average deviance or likelihood 
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over iterations subsequent to convergence. It is important to centre the V; (especially in 
the quadratic model) to guarantee early convergence of the B;, which means that the 
intercept will differ from Tanner’s. 

Summaries in Table 9.3 are based on the last 4000 iterations of three chain runs to 
5000 iterations. With N(0, 1000) priors on the B, and G(1, 0.001) prior on 1/ o°, there is 
a slight gain in simple fit, as measured by the average likelihood, with the quadratic 
model. However, the SBC suggests that the extra parameter is of doubtful value, with 
the simpler model preferred. As an illustration of the predictions of the complete failure 
times for observations on incomplete or censored times, the times censored at 8064 for 
temperature 150°C are predicted to complete at 16970 (median), with 95% interval 
(8470, 73 270) under the linear model. These appear as log10(t.comp) in Table 9.3. 


Example 9.3 Log-logistic model As an example of log-logistic survival, we apply the 
logistic model (9.4)-(9.5) to the logs of the leukaemia remission times from the Gehan 
(1965) study. A G(1, 0.001) prior on k = 1/o is taken in Equation (9.5) and flat priors 
on the treatment effect, which is the only covariate. Initially, k is taken the same across 
all subjects (Model A). 


A three chain run to 10000 iterations (with 1000 burn-in) leads to an estimated mean 
treatment difference of 1.31 (i.e. longer remission times for patients on the treatment) 
and a median for ø = 1/k of 0.58. These values compare closely with those obtained by 
(Aitkin et al., 1989, p. 297). Treatment and placebo group survival curves as in Equation 
(9.6) show the clear benefit of the treatment in terms of mean posterior probabilities up 
to t = 50 (Table 9.4). 

In a second model (Model B), the scale parameter k is allowed to differ between the 
treatment and placebo groups. The predictive loss criterion in Equation (9.1) suggests 
the simpler model to be preferable to this extension; the same conclusion follows from 
the pseudo Bayes factor based on Monte Carlo estimates of the CPO (Sahu et al., 1997). 
It is, however, noteworthy that B; is enhanced in Model B, and that variability o 
appears greater in the treatment group. 


Table 9.3 Motorette analysis, parameter summary 


Quadratic Mean St. devn. 2.5% Median 97.5% 


SBC at posterior mean —22.1 

Log-likelihood —3.3 6.2 —17.0 —2.7 7.1 
Bı 3.35 0.07 3.21 3.34 3.50 
B2 5.43 0.85 4.00 5.34 7.32 
B3 14.01 6.33 2.91 13.52 27.75 
(o 0.26 0.05 0.18 0.25 0.38 
log10(t.comp) 4.63 0.38 4.01 4.58 5.48 
Linear 


SBC at posterior mean —19 


Log-likelihood —4.6 6.0 —18.1 —4.0 5.5 

By 3.48 0.06 3.38 3.48 3.62 
B2 4.34 0.46 3.49 4.32 5.31 
(o 0.27 0.05 0.19 0.26 0.39 


log10(t.com 4.26 0.24 3.93 4.23 4.82 
8 p 
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Table 9.4 Treatment effect on remission times (Model A) 


Parameter Mean St. devn. 2.5% Median 97.5% 
Bo 1.886 0.221 1.436 1.890 2.312 
Bı (Treatment effect on 1.312 0.354 0.652 1.296 2.046 
remission time) 
o 0.582 0.093 0.427 0.573 0.790 
Group survival curves 
Placebo Treated 
Time Mean St. devn. Mean St. devn. 
1 0.958 0.026 0.995 0.005 
2 0.882 0.053 0.985 0.011 
3 0.791 0.072 0.971 0.019 
4 0.701 0.083 0.954 0.026 
5 0.616 0.090 0.934 0.034 
6 0.541 0.092 0.913 0.041 
7 0.475 0.091 0.890 0.048 
8 0.419 0.089 0.866 0.055 
9 0.371 0.087 0.842 0.062 
10 0.331 0.083 0.817 0.068 
11 0.296 0.080 0.791 0.073 
12 0.266 0.076 0.766 0.079 
13 0.240 0.073 0.741 0.083 
14 0.218 0.070 0.716 0.088 
15 0.199 0.067 0.692 0.092 
16 0.182 0.064 0.668 0.095 
17 0.168 0.061 0.645 0.098 
18 0.155 0.058 0.623 0.101 
19 0.143 0.056 0.601 0.103 
20 0.133 0.053 0.580 0.105 
21 0.124 0.051 0.560 0.106 
22 0.116 0.049 0.541 0.108 
23 0.109 0.047 0.522 0.109 
24 0.102 0.045 0.504 0.109 
25 0.096 0.043 0.487 0.110 
26 0.090 0.042 0.471 0.110 
27 0.085 0.040 0.455 0.111 
28 0.081 0.039 0.440 0.111 
29 0.077 0.038 0.426 0.111 
30 0.073 0.036 0.412 0.110 
31 0.069 0.035 0.399 0.110 
32 0.066 0.034 0.386 0.109 
33 0.063 0.033 0.374 0.109 
34 0.060 0.032 0.363 0.108 
35 0.057 0.031 0.352 0.108 
36 0.055 0.030 0.341 0.107 
37 0.053 0.029 0.331 0.106 


(continues) 
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Table 9.4 (continued) 


Placebo Treated 
Time Mean St. devn. Mean St. devn. 
38 0.051 0.028 0.321 0.105 
39 0.049 0.028 0.312 0.104 
40 0.047 0.027 0.303 0.103 
41 0.045 0.026 0.295 0.102 
42 0.043 0.025 0.286 0.101 
43 0.042 0.025 0.279 0.100 
44 0.040 0.024 0.271 0.099 
45 0.039 0.024 0.264 0.098 
46 0.038 0.023 0.257 0.097 
47 0.036 0.022 0.250 0.096 
48 0.035 0.022 0.244 0.095 
49 0.034 0.021 0.238 0.094 


50 0.033 0.021 0.232 0.093 


Example 9.4 Nursing home length of stay Morris et al. (1994) consider length of stay 
for a set of 1601 nursing home patients in terms of a treatment and other attributes (age, 
health status, marital status, sex) which might affect length of stay. Stay is terminated 
either by death or return home. We here estimate linear covariate effects in a propor- 
tional Weibull hazard 


h(t, z) = yt’! exp (Bx) 


This is equivalent to a regression of the logged length of stay on the regressors with a 
scaled error term u 


log()}= ox ou 


where o = 1/y and ¢ = —Bo. We obtain results on the treatment and attribute vari- 
ables similar to those of Morris et al. These are the coefficients @ on the predictors of log 
length of stay, which is a close proxy for length of survival in the context. All covariates 
are categorical except age, which is converted to a spline form 


age.s = min(90, age)—65. 


Health status is based on numbers of activities of daily living (e.g. dressing, eating) 
where there is dependency in terms of assistance being required. Thus, health=2 if there 
are four or less ADLs with dependence, health=3 for five ADL dependencies, health=4 
for six ADL dependencies, and health=5 if there were special medical conditions 
requiring extra care (e.g. tube feeding). 

Convergence with a three chain run is achieved early and the summary in Table 9.5 is 
from iterations 500-2000. Personal attributes such as gender, health status, age and 
marital status all impact on length of stay. Married persons, younger persons and males 
have shorter lengths of stay, though the effect of age straddles zero. Married persons, 
often with a care-giver at home, tend to enter with poorer initial functional status, 
associated with earlier death. The experimental treatment applied in some nursing 
homes involved financial incentives to improve health status and (for the non-terminal 
patients) achieve discharge within 90 days; however, the effect is not towards lower 
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Table 9.5 Nursing home stays, parameter summary 


Mean St. devn. 2.5% Median 97.5% 
Full model 
Intercept 5.739 0.165 5.437 5.735 6.070 
Age 0.007 0.007 —0.007 0.007 0.020 
Treatment 0.201 0.093 0.017 0.203 0.381 
Male —0.562 0.108 —0.772 —0.562 —0.342 
Married —0.262 0.128 —0.516 —0.259 —0.006 
Health 3 0.045 0.131 —0.215 0.045 0.301 
Health 4 —0.377 0.131 —0.636 —0.378 —0.117 
Health 5 —0.872 0.166 —1.201 —0.873 —0.550 
Scale 1.635 0.036 1.564 1.635 1.706 
Reduced model 
Intercept 5.766 0.149 5.475 5.763 6.066 
Age 0.007 0.007 —0.007 0.007 0.020 
Treatment 0.198 0.089 0.022 0.197 0.373 
Male —0.565 0.112 —0.791 —0.566 —0.342 
Married —0.257 0.127 —0.499 —0.258 0.001 
Health 4 —0.400 0.101 —0.594 —0.399 —0.205 
Health 5 —0.897 0.149 —1.182 —0.897 —0.596 


Scale 1.635 0.037 1.557 1.636 1.707 


length of stay, possibly because patients in treatment homes were more likely to be 
Medicaid recipients (Morris et al., 1994). 

It would appear that the effect of health status level 3 is not clearly different from zero 
(i.e. from the null parameter of the reference health status), and so the groups 2 and 3 
might be amalgamated. We therefore fit such a model (Model B in Program 9.4), and 
find its pseudo-marginal likelihood to be in fact higher than the model (—8959 vs. 
—9012) involving the full health status scale. The conventional log likelihood averages 
around —8548 for both models. 


9.3 ACCELERATED HAZARDS 


In an Accelerated Failure Time (AFT) model the explanatory variates act multiplica- 
tively on time, and so affect the ‘rate of passage’ to the event; for example, in a clinical 
example, they might influence the speed of progression of a disease. Suppose 


vi = Bixi Box +. By Xpi (9.7a) 


denotes a linear function of risk factors (without a constant). Then the AFT hazard 
function is 


A(t, x) = e”'ho(e”'t) 
For example, if there is Weibull time dependence, the baseline hazard is 


ho(t) = ryt! 
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and under an AFT model, this becomes 


h(t, x) = i ry(te i! 


9.7b 
= (e")* ryt! ee 


Hence the durations under an accelerated Weibull model have a density 
We, y) 

whereas under proportional hazards the density is 
We", y) 


If there is a single dummy covariate (e.g. x; = 1 for treatment group, 0 otherwise), then 
v; = Bx; = B when x; = 1. Setting = e®, the hazard for a treated patient is 


bho(t) 


and the survivor function is So(t). The multiplier is often termed the acceleration 
factor. 
The median survival time under a Weibull AFT model is 


1.50 = [log 2/ {Xe }]/y (9.8) 


In an example of a Bayesian perspective, Bedrick et al. (2000) consider priors for the 
regression parameters in Equation (9.7a) expressed in terms of their impact on median 
survival times in Equation (9.8) rather than as direct priors on the B,. 


Example 9.5 Breast cancer survival We consider the breast cancer survival times (in 
weeks) of 45 women, as presented by Collett (1994, p. 7). The risk factor is a classifica- 
tion of the tumour as positively or negatively stained in terms of a biochemical marker 
HPA, with x; = 1 for positive staining, and x; = 0 otherwise. We use a G(1, 0.001) prior 
on the Weibull shape parameter. 


A three chain run of 5000 iterations shows early convergence of B and convergence at 
around iteration 750 for y. The summary, based on the last 4000 iterations, shows the 
posterior mean of y to be 0.935, but with the 95% interval straddling unity (Table 9.6). 
The posterior mean of the positive staining parameter B is estimated as around 1.1, and 
shows a clear early mortality effect for such staining. The CPOs show the lowest 
probability under the model for cases 8 and 9, where survival is relatively extended 
despite positive staining. The lowest scaled CPO (the original CPOs are scaled relative 
to their maximum) is 0.016 (Weiss, 1994). 


Table 9.6 Breast cancer survival, parameter estimates 


Mean St. devn. 2.5% Median 97.5% 
B 1.105 0.566 0.098 1.075 2.288 
y (Weibull shape) 0.935 0.154 0.651 0.931 1.266 
t.50 92.0 23.8 54.7 89.1 147.8 
Hazard ratio under 3.14 1.75 1.09 2.73 7.50 


proportional hazards 
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The posterior mean of the analytic median survival formula (9.8) for women with 
cancer classed as positively stained is around 92 weeks, a third of the survival time of 
women with negatively stained tumours. Under the proportional hazards model the 
hazard ratio would be eY® which has a median of 2.7 similar to that cited by Collett 
(1994, p. 214), though is not precisely estimated both because of the small sample and 
because it involves a product of parameters. 


9.4 DISCRETE TIME APPROXIMATIONS 


Although events may actually occur in continuous time, event histories only record time 
in discrete units, generally called periods or intervals, during which an event may only 
occur once. The discrete time framework includes population life tables, clinical life 
table methods such as the Kaplan—Meier method, and discrete time survival regressions. 
Applications of the latter include times to degree attainment (Singer and Willett, 1993), 
and the chance of exit from unemployment (Fahrmeir and Knorr-Held, 1997). The 
discrete framework has been adapted to semi-parametric Bayesian models (Ibrahim et 
al., 2001a) as considered below. 
Consider a discrete partition of the positive real line, 


0<a <a <.. <a, <% 


and let 4; denote the interval [a;—1, aj), with the first interval being [0, a,). The discrete 
distributions analogous to those above are 


fj = Pr(TeA;) (9.9) 
=$- Sj 1 
where, following Aitkin et al. (1989), 
S; = Pr(T > aj-1) (9.10) 
=F Spt aa SJE 


The survivor function at aj_; is S; and at a; is S; ı with the first survivor rate being 
Sı = 1. The jth discrete interval hazard rate is then 


h; = Pr(TeA;|T > a1) = f;/S; 
It follows that 
hy = (Sj — S; 1)/S; 
and so 
S; 1/451- h; 


So the chance of surviving through r successive intervals, which is algebraically 
E 
S-1=[[S) 1/5; 
j=l 
can be estimated as a ‘product limit’ 


Sy 1= Jc — hy) 
jal 
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The likelihood is defined over individuals 7 and periods j and a censoring variable w; is 
coded for the end point a; of each interval (a;_1, aj] and each subject, up until the final 
possible interval (az, oo]. Suppose we have a non-repeatable event. Then if the observa- 
tion on a subject ends with an event within the interval (a;_1, a;], the censoring variable 
would be coded 0 for preceding periods, while w; = 1. A subject still censored at the end 
of the study would have indicators w; = 0 throughout. 

In aggregate terms, the likelihood then becomes a product of L binomial probabil- 
ities, with the number at risk at the beginning of the jth interval being K;. This total is 
composed of individuals still alive at aj; and still under observation (i.e. neither 
censored or failed in previous intervals). For individuals / still at risk in this sense (for 
whom Rj = 1) the total deaths in the jth interval are 


Yo wy = 4 

Rj=1 
The Kaplan—Meier estimate of the survival curve is based on the survival rates estimated 
from the binomial events with K; — d; subjects surviving from n; at risk. In practice, we 
may restrict the likelihood to times at which failures or deaths occur, i.e. when A; is non- 
Zero. 

Some authors have taken the Kaplan—Meier approach as a baseline, but proposed 
non-parametric methods to smooth the original KM estimates. Thus, Leonard et al. 
(1994) suggest that the unknown density function of survival times f(t) be obtained via 
an equally weighted mixture of hazard functions with m components 


hmlt, & n) =m! X A, & n) (9.11) 
k=1 


where each A(t, &,, n) is a specific hazard function (e.g. exponential, Weibull), n denotes 
parameters of that function not varying over the mixture and ë, are components that do 
vary. The number of components may exceed the number of observations n, in which 
case some will be empty. The equally weighted mixture is analogous to kernel estimation 
and smooths f without assuming that f itself comes from a parametric family. This 
mixture has known component masses, and is easier to estimate and analyse than a 
discrete mixture model with unknown and unequal probabilities. The special case 
& = & =... = čp means that f can be represented by a parametric density. 


Example 9.6 Colorectal cancer: Kaplan-Meier method To illustrate the Kaplan- 
Meier procedure, we first consider data on survival in months in 49 colorectal cancer 
patients (McIllmurray and Turkie, 1987), as in Table 9.7. 


We restrict the analysis to survival in the treatment group subject to linolenic acid, and 
in relation to five distinct times of death, namely 6, 10, 12, 24 and 32 months. Totals at risk 
K; (for whom Ry = 1) are defined according to survival or withdrawal prior to the start of 
the jth interval. Thus, two of the original 25 patients censored at one and five months are 
not at risk for the first interval where deaths occur, namely the interval (6, 9). For these 
two patients the survival rate is 1. At 6 months, two treated patients die in relation to a 
total at risk of Kı = 23 patients, so the survival rate is 1 — 2/23 = 0.913. The next patient, 
censored at nine months, is also subject to this survival rate. The survival rate changes 
only at the next death time, namely 10 months, when K = 20 patients are at risk, and 
there are two deaths, so that the survival rate (moment estimate) is (0.9)(0.913)=0.822. 
This process is repeated at the next distinct death time of 12 months. 
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Table 9.7 Survival in patients with Dukes’ C colorectal cancer and assigned to 
linolenic acid or control treatment. Survival in months (* = censored) 


Linolenic acid (n=25) Control (n = 24) 
1* 13* 3* 18* 
5* 15* 6 18* 
6 16* 6 20 
6 20* 6 22* 
9* 24 6 24 
10 24* 8 28* 
10 27* 8 28* 
10* 32 12 28* 
12 34* 12 30 
12 36* 12* 30* 
12 36* 15* 33* 
12 44* 16* 42 
12* 


A technique useful in many applications with survival times involves reformulating 
the likelihood to reveal a Poisson kernel (Fahrmeir and Tutz, 2001; Lindsey, 1995). 
Aitkin and Clayton (1980) show how for several survival densities, estimation is possible 
via a log-linear model for a Poisson mean that parallels a log-linear model for the 
hazard function. Here, although other approaches are possible, we use the equivalent 
Poisson likelihood for the outcome indicators w;. These have means 


0; = Rijhj 


where h; is the hazard rate in the jth interval and has a prior proportional to the width of 
that interval, namely h; ~ G(c[a; — aj-1], c), where c represents strength of prior belief. 

Taking c = 0.001, and three chains to 5000 iterations (with 1000 burn in) we derive 
the survival rates S; at the five distinct times of death, as in Table 9.8, together with the 
hazard rates hj. 


Table 9.8 Colorectal cancer: survival and hazard rates at distinct death times 


Mean St. devn. 2.5% Median 97.5% 
Survival probabilities 
Sı 0.913 0.062 0.759 0.926 0.990 
S2 0.821 0.086 0.621 0.834 0.951 
S3 0.627 0.116 0.374 0.637 0.824 
S4 0.548 0.130 0.270 0.556 0.773 
S5 0.437 0.153 0.111 0.447 0.703 
Hazard 
hy 0.087 0.062 0.010 0.074 0.241 
hz 0.100 0.071 0.013 0.085 0.286 
h3 0.236 0.118 0.067 0.216 0.513 
h4 0.127 0.127 0.004 0.087 0.464 


hs 0.202 0.199 0.005 0.141 0.733 
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Example 9.7 Colon cancer survival: non-parametric smooth via equally weighted 
mixture As an example of the equal weighted mixture approach, we consider data 
on colon cancer survival in weeks following an oral treatment (Ansfield et al., 1977). Of 
52 patients times, 45 were uncensored (ranging from 6-142 weeks) and seven patients 
had censored survival times. Leonard et al. (1994) investigated Weibull and exponential 
hazard mixtures in Equation (9.11), but were unable to find a stable estimate for the 
Weibull time parameter n in 


h(t, n, Ex) = n&tr! 


where &, are location parameters which vary over components. They therefore used a set 
value of n = 2, and performed an exponential mixture analysis of the squared survival 
times (s = £”), so that 


h(s, E) = Ex 


They assume the exponential means are drawn from gamma density G(a, B), where « is 
known but $ is itself gamma with parameters k and ¢. In the colon cancer example, they 
set a=K=0.5 and ¢=52~7; these prior assumption are consistent with average 
survival of a year in the original time scale. 

Here the original Weibull mixture is retained in the untransformed time scale, and 
m = 15 components taken, with n a free parameter assigned an E(1) prior. The equiva- 
lent assumption to Leonard et al. on the G(a, B) prior for the location parameters (in 
the original time scale) involves setting a = k = 0.5 and ¢ = 1/52. 

Of interest are the smoothed survivor curve S(t) and the density f (t) itself. We obtain 
similar estimates of the former to those represented in Figures | and 2 of Leonard et al. 
(1994). The estimates of B and a/R (i.e. the parameters governing the Weibull means &,) 
will be affected by using the full Weibull hazard. The last 4000 of a three chain run of 
5000 iterations show B not precisely identified and skewed with median at around 115. 
The Weibull time parameter has a posterior mean estimated at 1.72, with 95% interval 
from 1.35 to 2.10. Figure 9.1 shows the resulting survivor curve up to 200 weeks. 

The analysis is similar in terms of parameter estimates and survivor function whether 
constrained (monotonic) or unconstrained sampling of &; is used, or whether the logs of 
€, are modelled via Normal priors — instead of modelling the €& as gamma variables 
(Model B in Program 9.7). In the latter case, the prior mean of log (ë) is —4 with 
variance 1, so the &, will have average 0.018. 

However, the predictive loss criterion in Equation (9.1) is lower for constrained 
gamma sampling than unconstrained gamma sampling, because of lower variances of 
new times z; when these times are censored. The log-Normal approach has a lower 
predictive loss than either gamma sampling option, and gives a slightly higher estimate 
of the Weibull time slope, namely 1.92 with 95% interval (1.45, 2.55). 


9.4.1 Discrete time hazards regression 


The usual methods for discrete time hazards assume an underlying continuous time 
model, but with survival times grouped into intervals, such that durations or failure 
times between a;_; and aj are recorded as a single value. Assume that the underlying 
continuous time model is of proportional hazard form 


M(t, 2) = do(t) exp (Bz) (9.12) 


376 SURVIVAL AND EVENT HISTORY MODELS 


Probability 


= Mean a '2.5% — '97.5% 


Figure 9.1 Weibull mixture analysis 


with survivor function 
S(t, x) = exp ( — Ao(t)e®*) 


where the integrated hazard is denoted by 


A(t) = i o(u)du (9.13) 


Then the conditional probability of surviving through the jth interval given that a 
subject has survived the previous j — | intervals is 


qj = exp ej scot 


= exp[—e*"{Ao(a;) — Ao(aj_1)}] 
while 
hj=1-—q 


is the corresponding hazard rate in the jth interval [aj_1, aj). 
The total survivor function until the start of the jth interval is 


S; = exp[— e Ag(a_1)] 
Defining 
y; = In[Ao(q@) — Ao(q@-1)] 


the likelihood of an event in interval [a;_1, aj) given survival until then, can be written 
(Fahrmeir and Tutz, 2001; Kalbfleisch and Prentice, 1980) as 
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j-l 
hS; = {1 — exp(—e® ¥)} J [exp *) (9.14) 
k=1 


Let w; = 1 for an event in the jth interval and w; = 0 otherwise. As to the regression 
term in Equation (9.14), we may allow z[aj] to be potentially time varying predictors. 
More generally, also let B, denote a regression effect fixed within intervals, but that may 
vary between intervals. If the predictors themselves are time specific one may introduce 
lagged as well as contemperaneous effects (Fahrmeir and Tutz, 2001). 

The typical log-likelihood contribution for an individual surviving j — 1 intervals 
until either an event or censoring is then 


jel 
wjlog[l —exp(—exp{y Bilall- So exp fy, By zlax]} 
k=l 
This likelihood reduces to Bernoulli sampling over individuals and intervals with 
probabilities of the event m;; modelled via a complementary log-log link, and with the 
censoring indicator forming the response. Thus, for a subject observed for r; intervals 
until either an event or censoring 


wy ~ Bernoulli(m;) i=1,..,n,j=1,..,ri 
log {— log (1 — m;)} =y;  B;zila;] (9.15) 


As well as fixed effect priors on y; and B,, one can specify random walk priors, also 
called correlated prior processes (Gamerman, 1991; Sinha and Dey, 1997; Fahrmeir and 
Knorr-Held, 1997). For example, a first order random walk prior is 


YVi=Yy & 


where the g; are white noise with variance On A variant of this is the local linear trend 
model 


YVi=V ð ey 
B =ð; Ey 


where both gı and £2 are white noise. Another option, again to avoid the parameterisa- 
tion involved in assuming fixed effects, is for y; to be modelled as a polynomial in j 
(Mantel and Hankey, 1978). Smoothness priors may also be used for time varying 
regression coefficients B, (Sargent, 1997). Thus, a first order random walk prior in a 
particular regression coefficient would be 


BSB; g 


where the g; have variance og. 


Example 9.8 Longitudinal study of ageing Dunlop and Manheim (1993) consider 
changes in the functional status of elderly people (over age 70) using data from the 
US Longitudinal Study of Ageing, carried out in four interview waves in 1984, 1986, 
1988 and 1990. This was a prospective study following an initial sample of around 5000, 
either through to death or the final wave. Dunlop and Manheim discuss the issues of 
modelling the probability of an initial disability in Activities of Daily Living (ADL) in 
terms of time-dependent covariates. A complete analysis would involve allowing for left 
censoring, since some people are disabled at study entry (in 1984). 
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They confine their analysis to people who were able in 1984 and consider transitions 
to disablement or loss of function on six ADLs: walking, dressing, bathing, toileting, 
feeding, and transferring (getting from chair to bed, and other types of short range 
mobility). They identify an empirical ordering for the average age at loss of function on 
these activities: the first disability is walking with an average age of 84 when disability 
commences, then bathing at age 87, transferring (age 90), dressing (age 92), toileting 
(age 93), and finally, feeding at age 100. 

The analysis here follows Dunlop and Manheim in considering transitions to toileting 
disability. In terms of the empirical ordering of Dunlop and Manheim, it would be 
expected that people who are already limited in walking, bathing, transferring and 
dressing to have a higher chance of a move to toilet disability. In a subsidiary analysis, 
feeding status is also used as a predictor. Thus, time-dependent dummy covariates 
(present/absent) at waves 1, 2 and 3 on these disabilities are used to predict transitions 
to toileting disability in the intervals (k, k 1), with k = 1, 2, 3. So yx, = 1 if a transi- 
tion to disability occurs for person iin the interval (k, k 1). 

A complementary log-log transform relates the probability that y,;=1 (given 
Vik-1 = Vik-2 = --9) to the included covariates. As well as the ADL status variables, 
age and education (a continuous variable) are used to predict loss of functional status; 
these are divided by 100 for numerical reasons. Finally, a linear term in < itself (i.e. the 
study wave) is included; this is then a minimal form of the polynomial prior mentioned 
above. 

Only persons observed through all four waves are included. For simplicity, observa- 
tions where toilet status atk ork 1 is unknown are excluded, but cases with missing 
status on walking, bathing, transferring and dressing at waves 1 to 3 are included. This 
involves Bernoulli sampling according disability rates 5, specific to period k and 
activity J. 

A three chain run to 2000 iterations is taken with initial value files including imputed 
values on the incomplete data (convergence is apparent by about iteration 300, and 
posterior summaries based on iterations 300-2000). We find, as do Dunlop and Man- 
heim, that transition to toileting disability is positively related to age, and to preceding 
loss of status on walking, bathing and transferring (Table 9.9). A negative effect of 
education is obtained, as also reported by Dunlop and Manheim. However, in contrast 
to the results of Dunlop and Manheim, dressing status is not found to impact on this 
transition. Also, while Dunlop and Manheim obtain a non-significant impact of wave k 
itself, we obtain a clear gradient of increased chances of transition to disability with 
larger k. 

If the analysis is extended to include preceding feeding status, essentially the same 
results are obtained (though convergence is delayed till around 1500 iterations). Thus, 
existing loss of walking, bathing or transferring status are positive predictors of loss of 
toileting status in the next interval. However, loss of feeding or dressing status, which 
tend to occur among only the very old, are not clear preceding predictors of loss of 
toileting status. 

Program 9.8 also contains the code needed to make the coefficients on age, education 
and the ability variables time specific, as in Equation (9.15). Initial runs with this 
option suggested a slight lowering of the pseudo-marginal likelihood, and so 
results are not reported in full. There was a suggestion under this model that the 
education effect became less marked for waves 2 and 3 and enhancement of the linear 
time coefficient also occurred. Another modelling possibility is to use lagged ability 
variables. 
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Table 9.9 LSOA, onset of toileting disability 


(a) Excluding Feeding Mean St. devn. 2.5% Median '97.5% 


Intercept —8.3 0.8 —9.9 —8.3 —6.68 

Time (wave) 0.244 0.074 0.108 0.238 0.397 
Age* 0.060 0.010 0.038 0.060 0.080 
Education* —0.040 0.014 —0.067 —0.040 —0.011 
Walking 0.937 0.135 0.674 0.936 1.199 
Bathing 0.567 0.154 0.264 0.567 0.871 
Transferring 0.519 0.143 0.233 0.517 0.803 
Dressing —0.177 0.191 —0.553 —0.178 0.196 


(b) Including feeding 


Intercept —8.8 0.8 —10.2 —8.9 —7.4 

Time (wave) 0.251 0.073 0.104 0.251 0.394 
Age* 0.066 0.009 0.047 0.066 0.082 
Education* —0.038 0.014 —0.066 —0.038 —0.010 
Walking 0.936 0.138 0.662 0.939 1.211 
Bathing 0.564 0.157 0.249 0.566 0.857 
Transferring 0.534 0.148 0.251 0.533 0.826 
Dressing —0.156 0.195 —0.547 —0.152 0.217 
Feeding —0.228 0.269 —0.781 —0.217 0.272 


*Effects on original age and education scales 


Example 9.9 Transitions in youth employment As an illustration of time varying 
predictor effects, this example follows Powers and Xie (2000) in an analysis of a random 
sample of young white males from the US National Longitudinal Survey on Youth 
relating to the transition from employment (at survey time) to inactivity (in the next 
survey). The sample contains 1077 subjects, and the analysis here relates to five transi- 
tions between surveys (namely changes between 1979 and 1980, between 1980- 
1981,..,up to 1983-1984). 


Age effects on the outcome are defined by age bands 14-15, 16-17 . . up to 22-23, and 
there are three time varying covariates: whether the subject graduated in the previous 
year; the local labour market unemployment rate; and whether the respondent had left 
home at survey time. Fixed covariates relate to father’s education (none, high school or 
college), family structure not intact, family income in 1979, an aptitude score (Armed 
Services Vocational Aptitude Battery Test, ASVAB) and living in the Southern USA or 
not. There are fourteen coefficients for AIC and SBC calculations; here we the SBC 
adjusted for the parameter priors is considered (Klugman, 1992) — see Models Al and 
B1 in Program 9.9. 

A model with fixed (i.e. time stationary) effects on all coefficients shows peak rates of 
the outcome at ages 18-19 and 20-21, and a positive relation of the transition to 
inactivity with coming from a broken home (Table 9.10). (Parameter summaries are 
based on a three chain model to 2500 iterations, with early convergence at around 500 
iterations). Though not significant the effects of high local unemployment and recent 
graduation are biased towards positive effects. This event is negatively related to 
aptitude, living in the South and having left home; the income effect is also predomin- 
antly negative, but the 95% credible interval just straddles zero. 


380 SURVIVAL AND EVENT HISTORY MODELS 


Table 9.10 Employment transitions 


Mean St. devn. 2.5% Median 97.5% 
Unemployment effect varying 
SBC —1097 
Age Effects 
Age 14-15 —2.92 0.33 —3.58 —2.91 —2.30 
Age 16-17 —2.85 0.25 —3.35 —2.85 —2.36 
Age 18-19 —2.57 0.26 —3.08 —2.57 —2.05 
Age 20-21 —2.57 0.29 —3.11 —2.57 —1.99 
Age 22-23 —2.67 0.44 —3.57 —2.66 —1.83 
Covariates 
FHS 0.226 0.156 —0.077 0.224 0.532 
FCOL 0.072 0.178 —0.276 0.075 0.414 
GRAD 0.190 0.176 —0.164 0.192 0.527 
INCOME —0.234 0.129 —0.492 —0.233 0.004 
ASVAB —0.440 0.074 —0.585 —0.440 —0.295 
NONINT 0.379 0.153 0.080 0.380 0.677 
SOUTH —0.517 0.185 —0.891 —0.513 —0.163 
SPLIT —0.437 0.187 —0.817 —0.432 —0.080 
Unemployment (by year) 
1979-80 0.257 0.210 —0.128 0.250 0.677 
1980-81 0.243 0.200 —0.132 0.239 0.639 
1981-82 0.236 0.192 —0.129 0.233 0.614 
1982-83 0.225 0.189 —0.135 0.224 0.590 
1983-84 0.210 0.195 —0.154 0.208 0.581 
Stationary coefficient model 
SBC —1056.8 
Age effects 
Age 14-15 —2.857 0.320 —3.538 —2.851 —2.232 
Age 16-17 —2.787 0.237 —3.242 —2.784 —2.336 
Age 18-19 —2.514 0.248 —3.005 —2.517 —2.033 
Age 20-21 —2.517 0.277 —3.074 —2.499 —1.974 
Age 22-23 —2.618 0.431 —3.487 —2.590 —1.846 
Covariates 
FHS 0.226 0.149 —0.063 0.223 0.519 
FCOL 0.063 0.181 —0.286 0.062 0.407 
GRAD 0.177 0.178 —0.178 0.177 0.532 
INCOME —0.240 0.125 —0.499 —0.242 0.004 
ASVAB —0.439 0.074 —0.585 —0.438 —0.293 
NONINT 0.382 0.150 0.089 0.387 0.677 
UNEMP 0.189 0.167 —0.128 0.190 0.542 
SOUTH —0.509 0.180 —0.881 —0.504 —0.170 
SPLIT —0.444 0.180 —0.815 —0.439 —0.106 
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Allowing the unemployment coefficient to vary over time according to a random 
walk prior suggest a lessening effect of local labour market conditions, though no 
coefficient is significant in any year. This more heavily parameterised model leads to a 
clear worsening in the Schwarz criterion and also has a slightly worse predictive loss 
criterion (9.1) for values w = 1, w = 100 and w = 10000. 

The previous analysis used a diffuse G(1, 0.001) prior on the precision 1/ Op: Sargent 
(1997) adopts an informative prior in the study he considers, namely a high precision 
consistent with small changes in 8;. For instance, one might say that shifts greater than 
+0.1 in the unemployment coefficient in adjacent periods are unlikely. If this is taken as 
one standard deviation (i.e. og = 0.1), then the precision would have mean 100, and a 
prior such as G(1, 0.01) might be appropriate as one option (see exercises). 


9.4.2 Gamma process priors 
In the proportional hazards model 
h(ti, xi) = ho(ti) exp (Bx) 


a non-parametric approach to specifying the hazard h or cumulative hazard H is often 
preferable. Priors on the cumulative hazard which avoid specifying the time dependence 
parametrically have been proposed for counting process models, as considered below 
(Kalbfleisch, 1978). However, a prior may also be specified on the baseline hazard ho 
itself (e.g. Sinha and Dey, 1997). 

Thus, consider a discrete partition of the time variable, based on the profile of 
observed times {t,,...ty} whether censored or not, but also possible referring to 
wider subject matter considerations. Thus, M intervals (ao, a], (a1, a2], .. . (am-1, am] 
are defined by breakpoints at aọ < a; < ... < ay, where ay exceeds the largest ob- 
served time, censored or uncensored, and ag > 0. Let 


8; = hola) — holaj-1) j=1,..M 


denote the increment in the hazard for the jth interval. Under the approach taken by 
Chen et al. (2000), and earlier by workers such as Dykstra and Laud (1981), the 8; are 
taken to be gamma variables with scale y and shape 


a(a;) = a(aj—ı) 


where a is monotonic transform (e.g. square root, logarithm). Note that this prior 
strictly implies an increasing hazard, but Chen et al. cite evidence that this does not 
distort analysis in applications where a decreasing or flat hazard is more reasonable for 
the data at hand. 

In practice, the intervals a; — aj_; might be taken as equal length and a as the identity 
function. If the common interval length were L, then the prior on the 8; would be set at 
G(L, y). Larger values of y reflect more informative beliefs about the increments in the 
hazard (as might be appropriate in human mortality applications, for example). 

The likelihood assumes a piecewise exponential form and so uses information only on 
the intervals in which a completed or censored duration occurred. Let the grouped times 
si be based on the observed times t; after grouping into the M intervals. The cumulative 
distribution function (cdf) is 


F(s)=1- exp —e”! | incor} 
0 
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where B; is a function of covariates x;. Assuming họ(0) = 0, the cdf for subject 7 is 
approximated as 


M 
F(s; =1- oo =e” X òsi — aj-1) | (9.16) 


j=l 


where (u) = u if u > 0, and is zero otherwise. 

For a subject exiting or finally censored in the jth interval, the event is taken to occur 
just after a;—1, so that (s; — aj-1) = (s; — aj-1). The likelihood for a completed duration 
si in the jth interval, i.e. aj-1 < si < qj, is then 


Py = F(x), aj) — F(X, aj-1) 


where the evaluation of F refers to individual specific covariates as in Equation (9.16), 
as well as the overall hazard profile. A censored subject with final known follow up time 
in interval 7 has likelihood 


Si = 1 = F(x, aj) 


Example 9.10 Leukaemia remission To illustrate the application of this form of prior, 
consider the leukaemia remission data of Gehan (1965), with N = 42 subjects and 
observed ¢; ranging from 1-35 weeks, and define M = 18 intervals which define the 
regrouped times s;. The first interval (ao, a;] includes the times ¢t = 1, 2; the second 
including the times 3,4..up to the 18th (a17, aıg] including the times 35,36. The mid 
intervals are taken as 1.5 (the average of 1 and 2), and then 3.5, 5.5,..and so on up to 
35.5. These points define the differences a — ay, a2 — aj, .. as all equal to 2. The Gehan 
study concerned a treatment (6-mercaptopurine) designed to extend remission times; the 
covariate is coded 1 (for placebo) and 0 for treatment, so that end of remission should 
be positively related to being in the placebo group. 


It may be of interest to assess whether a specific time dependence (e.g. exponential or 
Weibull across the range of all times, as opposed to piecewise versions) is appropriate if 
this involves fewer parameters; a non-parametric analysis is then a preliminary to 
choosing a parametric hazard. One way to gauge this is by a plot of — log {S(u)} against 
u, involving plots of posterior means against u, but also possibly upper and lower limits 
of — log (S) to reflect varying uncertainty about the function at various times. A linear 
plot would then support a single parameter exponential. 

In WINBUGS it is necessary to invoke the ‘ones trick’ (with Bernoulli density for the 
likelihoods) or the ‘zeroes trick’ (with Poisson density for minus the log-likelihoods). 
With B; = Bọ Bı Placebo, a diffuse prior is adopted on the intercept, and an N(0, 1) 
prior on the log of hazard ratio (untreated vs. treated). Following Chen et al. (2000, 
Chapter 10), a G(aj—aj-1,0.1) prior is adopted for ,j=1,..,M —1 and a 
G(a; — aj-1, 10) prior for òm. 

With a three chain run, convergence (in terms of scaled reduction factors 
between 0.95 and 1.05 on Bo, Bı and the 3,) is obtained at around 1300 iterations. 
Covariate effects in Table 9.11 are based on iterations 1500-5000, and we find a positive 
effect of placebo on end of remission as expected, with the remission rate about 
3.3 times (exp(1.2)) higher. The plot of —logS as in Figure 9.2 is basically 
supportive of single rate exponentials for both placebo and treatment groups, though 
there is a slight deceleration in the hazard at medium durations for the placebo 


group. 
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Table 9.11 Leukaemia regression parameters 


Mean St. devn. 2.5% Median 97.5% 
Intercept —6.45 0.35 —7.19 —6.44 —5.80 
Treatment 1.22 0.36 0.53 1.24 1.95 
Minus log survivor rates Actual Survivorship rates 
Untreated Treated Posterior means 
Mean St. devn. Mean St. devn. Untreated Treated 

—logS; 0.06 0.04 0.02 0.01 Si 0.947 0.986 
—logS» 0.23 0.09 0.07 0.03 S2 0.796 0.941 
— log S; 0.39 0.12 0.12 0.05 S3 0.676 0.902 
— log S4 0.65 0.17 0.20 0.07 S4 0.525 0.844 
— log S5 0.94 0.23 0.28 0.09 S5 0.394 0.783 
— log S6 1.08 0.26 0.33 0.10 S6 0.345 0.756 
— log S; 1.40 0.33 0.42 0.13 S7 0.255 0.697 
— log Sg 1.59 0.38 0.48 0.14 Ss 0.215 0.666 
— log So 1.85 0.44 0.55 0.16 So 0.169 0.624 
— log Sio 2.06 0.49 0.62 0.18 Sio 0.141 0.594 
— log Sıı 2.20 0.53 0.66 0.19 Su 0.124 0.573 
—log Si. 2.54 0.63 0.76 0.22 Sp 0.093 0.53 
— log Sı3 2.91 0.74 0.87 0.25 SB 0.069 0.485 
—logSi4 3.11 0.81 0.93 0.27 Si4 0.059 0.462 
—log S15 3.32 0.88 0.99 0.29 Sis 0.05 0.441 
— log Si6 3.52 0.95 1.05 0.30 Si6 0.043 0.421 
— log Sı7 3.72 1.01 1.11 0.32 Si7 0.037 0.402 
— log Sig 3.72 1.01 1.11 0.32 Sig 0.037 0.402 
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Figure 9.2 Plot of -logS(u) versus u 
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9.5 ACCOUNTING FOR FRAILTY IN EVENT HISTORY AND SURVIVAL 
MODELS 


Whether the event history or survival analysis is in discrete or continuous time, unob- 
served differences between subjects may be confounded with the estimated survival 
curve and the estimated impacts of observed covariates. While there is considerable 
debate regarding sensitivity of inferences to the specification of unobserved heterogen- 
eity, this is an important aspect to consider, especially in complex models with time- 
varying effects of predictors or clustering of subjects. 

Thus, frailty differences, whether modelled by parametric random effects or by non- 
parametric methods, provide a way to account for within-cluster correlations in event 
history outcomes (Guo and Rodriguez, 1992), or for multivariate survival times where 
an underlying common influence is present (Keiding et al., 1997). Suppose subjects 
(patients, children) are arranged within aggregate units or clusters (hospitals, families) 
and event times are affected by cluster characteristics, known and unknown, as well as 
by the characteristics of individuals. Thus, for survival after surgery, patients are 
clustered within hospitals, while for age at pre-marital maternity, adolescent females 
are clustered according to family of origin (Powers and Xie, 2000). In these examples, 
random effects at cluster level are intended to account for unmeasured differences 
between clusters that may affect the outcome at the subject level. 


Example 9.11 Unmeasured heterogeneity and proportional hazards in discrete time 
regression McCall (1994) discusses methods for assessing the proportional hazards 
assumption in discrete time regression in a single level example (with no clustering), in 
the context of data on joblessness durations in months. He considers tests for time 
varying coefficients B, in Equation (9.15), which are equivalent to testing the propor- 
tional hazards assumption in the underlying continuous time model (e.g. Kay, 1977; 
Cox, 1972). In particular, he considers the need to allow for unmeasured heterogeneity 
when applying such tests. 


McCall uses simulated data based on the real joblessness example, with a sample of 
n = 500 persons observed for a maximum of 60 intervals. An underlying continuous 
hazard is assumed with \o(t) = 0.07, and an individual level gamma frailty 6; modifying 
the corresponding discrete time event probabilities, and distributed with mean and 
variance of 1, namely 0; ~ G(1, 1). Thus, for subject i at interval j, the most general 
model involves time varying predictor x and z and time varying regression coefficients 
Pı and By: 


hy =[1—exp(—9;expty, Byxy By773)] 


and 


j-1 
Sij = exp (= 5 exp (Yp BikXik paw) 


k=1 


There is a concurrent 0.01 probability of being right-censored in any interval (e.g. via 
emigration or death in the joblessness example). 

Here we consider only one of the scenarios of McCall, involving fixed predictors x 
and z, with fixed coefficients B; and B, and no trend in the y,;. For n= 100, the 
covariates are generated via 
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x=0.02 e 
z=O0.1x  e,/2 


where e; and e are standard Normal variables. This leads to a Bernoulli likelihood 
model as above generated by a complementary log-log link 


wy ~ Bernoulli(m;) i= 1, .. 100; 7=1,.. r; 
log{—log(l—yj)}=y Bix, Bəz; In (6;) 


where B, = —0.5, Ba = 0.5, y = log (0.07) = —2.66. The r; are determined by the min- 
imum duration at which an event is generated (at which point w; = 1 and no further 
observations are made), by censoring at 60 intervals, or by loss of follow up due to the 
concurrent exit rate of 0.01. The latter two categories are coded Fail=0, those undergo- 
ing an event as Fail=1 in Program 9.11. The data generation stage produces durations 
of stay, status at final observation (event or censored), and values for the two covariates. 
Values of 0; are generated, but would be unobserved in a real application. 

In the re-estimation stage, McCall allows (despite the generation mechanism) for 
Weibull baseline hazards with Ag; = j° and 


y=y(P-(j- D’) 


where p = 1 is the exponential case. Also, although the generation procedure assumes 
fixed regression coefficients, we may test for time varying coefficients using a linear 
dependence test, namely 


By*ž =B j (9.17a) 
By* =B zj (9.17b) 


with 6; = 82 = 0 if there is no time dependence. A quadratic form of this test may also 
be used. 

The data are now analysed as if in ignorance of the generating meachanism. In 
applying a Bayesian analysis, certain issues regarding prior specification, especially 
around the 6 parameters, may be noted. Since these parameters are multiplying large 
numbers j within the log-log link, there may be initial numerical problems (i.e. in the 
first 30 or 40 iterations) unless an informative prior is specified. Other stratagems such 
as rescaling time (e.g. by dividing by 100) may be adopted to reduce numerical prob- 
lems, especially in early samples. Note that a logit rather than complementary log-log 
link is more robust to numerical problems (Thompson, 1977). 

Rather than a multiplicative frailty, an additive frailty in à; = log (0;) is taken in the 
re-estimation stage. Some degree of information in the prior for the variance of N; assists 
in identifiability. Thus, one might say, a priori, that the contrast between the 97.5th and 
2.5th percentile of 6; might be 1000 fold at one extreme and 10 fold at the other; 
corresponding percentiles of 6; would be 30 = exp (3.4) as against 0.03 = exp ( — 3.5) 
and 3.2 = exp(1.16) as against 0.32 = exp ( — 1.14). The implied standard deviations, 
on the basis of approximate Normality of \; = log (0;), are 1.7 and 0.6, respectively, 
with precisions = 1/var(A) (respectively 3 and 0.33). A G(3, 2) prior on ¢ has 2.5th 
and 97.5th percentiles 0.33 and 3.6, and can be seen to permit a wide variation in beliefs 
about frailty. 

Model A initially assumes no heterogeneity, and adopts a logit link with scaled time. 
Three chain runs to 2500 iterations showed early convergence (under 500 iterations) and 
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Table 9.12 Unemployment duration analysis 


Mean St. devn. 2.5% Median 97.5% 

(A) No heterogeneity 

Bi —0.40 0.18 —0.75 —0.40 —0.04 
B2 0.62 0.15 0.33 0.61 0.92 
Avg. of Bi; —0.29 0.19 —0.66 —0.29 0.07 
Avg. of B3 0.26 0.13 0.02 0.26 0.52 
81( x time/100) 0.33 0.72 —1.09 0.34 1.72 
82( x time/100) —1.19 0.55 —2.28 —1.18 —0.08 
y —2.37 0.26 —2.92 —2.36 —1.90 
p 1.10 0.03 1.04 1.10 1.16 
(B) Log normal frailty 

Bi —0.66 0.29 —1.31 —0.65 —0.06 
B2 0.93 0.20 0.57 0.90 1.34 
Avg. of Bi; —0.66 0.33 —1.38 —0.63 —0.10 
Avg. of B3 0.67 0.27 0.21 0.64 1.26 
õı( x time/100) 0.20 0.81 —1.41 0.21 1.79 
82( x time/100) —0.97 0.68 —2.32 —0.97 0.40 
y —3.66 0.81 —5.60 —3.48 —2.51 
p 0.93 0.08 0.76 0.95 1.06 
b 0.73 0.51 0.17 0.59 2.15 
(C) Discrete mixture 

yı —3.61 0.88 —5.37 —3.70 —2.22 
Yo —2.22 0.34 —2.87 —2.22 —1.56 
Bi —0.50 0.22 —0.91 —0.50 —0.07 
B2 0.72 0.17 0.41 0.72 1.07 
Avg. of Bi; —0.45 0.26 —0.99 —0.44 0.03 
Avg. of B3; 0.41 0.18 0.08 0.41 0.76 
õı( x time/100) 0.21 0.80 —1.41 0.22 1.71 
82( x time/100) —0.98 0.63 —2.17 —0.98 0.26 
1 0.36 0.16 0.11 0.33 0.73 
> 0.64 0.16 0.27 0.67 0.89 
p 1.05 0.04 0.97 1.05 1.13 


summaries are based on iterations 500-2500 (Table 9.12). This model shows a trend in 
the coefficient of z, and might be taken (erroneously) to imply non-proportionality. It 
also shows a significant Weibull shape parameter. Accuracy in reproducing the central 
covariate effects may need to take account of the trend estimated in the coefficient: thus 
averaging BI and B3 (as in Equation (9.17)) over all intervals is a more accurate 
approach than considering the estimates of B and 8. The original values are reproduced 
in the sense that the 95% credible intervals for the average Bi and B3, contain the true 
values, but there appears to be understatement of the real effects. 

Adopting a parametric frailty (log-Normal with G(3, 2) prior on the precision of N) in 
Model B again shows the estimated 8; in Equation (9.17) straddling zero, as it should 
do. (Summaries are based on a three chain run of 2500 iterations and 500 burn-in.) 
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There is still a tendency towards a declining effect for z(62 < 0) to be identified, but the 
95% interval clearly straddles zero. In contrast to the no frailty model, the posterior 
credible interval for p includes the null value for 1. The average Bj; and 83; (denoted 
Beta.star[] in Program 9.11) in this model overstate (in absolute terms) the true effects, 
but are less aberrant than those of model A. In terms overall model assessment, the 
Pseudo-Marginal Likelihood (PML) based on CPO estimates shows model B to have a 
PML of —289.9, and so is preferred to Model A with PML of —294.3. 

Analysis C uses non-parametric frailty assuming two components with masses @, and 
@, and differing intercepts on each component. Thus, if C; is the latent group, we fit the 
model 


logit — m)} = [G]. — G- P) (B; 817/100)x; (B2 &2j/100)z; 


A constraint on the terms y; is used for identifiability (preventing label switching). This 
produces similar results to Model B on the 6 coefficients (i.e. both straddling zero) and 
p. The average Bi; and B3; in this model slightly understate the true effects. The gain in 
the PML over Model A is relatively modest, namely —292.9 vs. —294.3, but the pattern 
of results is much more plausible. 

In this connection, McCall argues that the simple linear trend tests are sensitive to 
whether unmeasured heterogeneity is allowed for, but less sensitive as to the form 
assumed for such heterogeneity (e.g. parametric or non-parametric). In the present 
analysis, a model without any form of heterogeneity might lead to incorrect inferences 
on proportionality, the size of regression coefficients and the shape of duration effects. 
This illustrates the potential inference pitfalls about duration or covariate effects if 
heterogeneity is present but not modelled. 


Example 9.12 Pre-marital maternity We consider age at pre-marital maternity data 
analysed by Powers and Xie (2000), with the time variable being defined as age at 
maternity minus 11. These data relate to n = 2290 women, arranged in m = 1935 
clusters (with the clustering arising from the fact that the survey includes at least two 
sisters from the same family). A birth outside marriage will typically be during the peak 
ages of maternity (15-34), and times for many women are right censored at their age of 
marriage. It is plausible to set an upper limit to censored times, such as 45 (age 56), and 
this considerably improves identifiability of the models described below. 


As one of several possible approaches, we adopt a parametric mixture, at cluster level 
(j=1,..m), namely a G(w, w) prior on unobserved factors @; with mean 1 and 
variance 1/w. Thus 


ti ~ Wei(v;, DI(tž, 45) i=1,...n 
vi = UFR; 
log(m;) =a Bx; 


with ¢;* denoting censored times or zero for uncensored times, and j = c; denoting the 
cluster of the ith subject. Another option would be a discrete mixture of K intercepts 
with the group to which the jth cluster belongs being determined by sampling from a 
categorical density with K categories. 

The first analysis (with which a mixture model can be compared) is a standard model 
with no frailty and a single intercept a. Starting values for three chains are based on null 
values and on the 2.5th and 97.5th points of a trial run; results are based on iterations 
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250-1000, given early covergence. The effects for binary indicators are reported as odds 
ratios, so values above | translate into ‘positive’ effects. The results are close to those 
reported by Powers and Xie, and show extra-marital maternity positively related to 
shorter education, and non-intact family of origin, and negatively related to weekly 
church attendance, Southern residence, and self-esteem (Table 9.13). 

For the model including frailty, we consider a prior on w that reflects possible 
variations in the chance of this outcome after accounting for the influence of several 
social and religious factors. For example, a G(2, 1) prior on w is consistent with 2.5th 
and 97.5th percentiles on w of 0.3 and 5.5; these values are in turn consistent with ratios 
of the 97.5th to the 2.5th percentiles of 0; of approximately 400000 and five, respect- 
ively. This would seem a reasonable range of beliefs about the chances of such a 
behavioural outcome. By contrast, taking a relatively diffuse prior on w will generate 
implausible variations in the chance of the outcome — and possibly cause identifiability 
problems. One might experiment with other relatively informative priors on w to assess 
sensitivity. Mildly informative priors on r and a are also adopted, namely r ~ G(3, 0.1) 
and a ~ N(—10, 100). 

With a G(2, 1) prior on w, and applying the upper age limit to censored values, early 
convergence (at under 500 iterations) is obtained in the mixture analysis, and summaries 
are based on iterations 500-2000. The posterior mean of w is around 15, consistent with 
a relatively narrow range in the chance of the outcome (a ratio of the 97.5th to the 2.5th 
percentiles of 6; of approximately 3). Compared to the analysis without frailty, Table 
9.13 shows an enhanced impact of self-esteem. There is also an increase in the Weibull 
parameter. In fact, the pseudo-marginal likelihood and predictive loss criteria both 
show no gain in allowing frailty. Other model variations might be significant, though. 
Thus, a monotonic hazard could be improved on for these data: either by a more 
complex non-monotonic hazard (e.g. log-logistic) or by discretising the ages and taking 
a piecewise or polynomial function to model the changing impact of age. 


9.6 COUNTING PROCESS MODELS 


An alternative framework for hazard regression and frailty modelling is provided by 
counting process models (e.g. Aalen, 1976; Fleming and Harrington, 1991; Andersen et 
al., 1993) which formulate the observations in a way enhancing the link to the broader 
class of generalized linear models. Consider a time W until the event of interest, and a 
time Z to another outcome (e.g. a competing risk) or to censoring. The observed 
duration is then T = min (W, Z), and an event indicator is defined, such that E = 1 if 
T=Wand E=0if T= Z. 
The counting process N(f) is then 


N(t)=1(T <t, E=1) (9.18) 
and the at risk function 
Y()=I(T >t) 
where /(C) is the indicator function. If a subject exits at time T, his/her at risk function 
Y(t) = 0 for times exceeding T. So the observed event history for subject i is N;(t), 
denoting the number of events which have occurred up to continuous time t. Let dN;(®) 


be the increase in N,(t) over a very small interval (t, ¢ dt), such that dN;(t) is (at most) 
1 when an event occurs, and zero otherwise. 
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Table 9.13 Models for premarital birth; parameter summaries (parameters converted to odds 
ratios for binary predictors) 


Model without frailty Mean St. devn. 2.5% Median 97.5% 


Intercept —12.4 0.5 —13.2 —12.5 —11.2 
Nonintact family (OR) 1.57 0.15 1.31 1.56 1.91 
Mothers education under 12 years (OR) 1.70 0.17 1.38 1.69 2.03 
Family Income* —0.048 0.050 —0.148  —0.049 0.051 
No of siblings 0.021 0.021 —0.018 0.021 0.06 
South (OR) 0.82 0.07 0.70 0.82 0.96 
Urban (OR) 1.03 0.09 0.87 1.02 1.21 
Fundamental Protestant upbringing (OR) 1.55 0.16 1.25 1.56 1.85 
Catholic (OR) 0.99 0.09 0.84 0.98 1.19 
Weekly Church Attender (OR) 0.93 0.07 0.79 0.93 1.08 
Traditional Sex Attitude Score 0.14 0.08 0.01 0.14 0.31 
Self Esteem Score —0.30 0.09 —0.47 —0.30 —0.13 
Weibull Shape Parameter 3.67 0.13 3.35 3.71 3.90 


Gamma mixture analysis 


Intercept —13.7 0.3 —14.2 —13.7 —12.9 
Gamma parameter (w) 14.7 3.1 9.2 14.5 21.5 
Nonintact family (OR) 1.71 0.16 1.40 1.71 2.04 
Mothers education under 12 years (OR) 1.92 0.21 1.54 1.93 2.36 
Family Income* —0.067 0.056 —0.177 —0.066 0.050 
No of siblings 0.03 0.02 —0.02 0.03 0.07 
South (OR) 0.77 0.07 0.65 0.77 0.93 
Urban (OR) 1.04 0.08 0.86 1.03 1.20 
Fundamental Protestant upbringing (OR) 1.65 0.18 1.30 1.64 1.99 
Catholic (OR) 0.93 0.09 0.77 0.93 1.11 
Weekly Church Attender (OR) 0.89 0.07 0.77 0.89 1.03 
Traditional Sex Attitude Score 0.18 0.09 —0.01 0.19 0.36 
Self Esteem Score —0.39 0.10 —0.60 —0.39 —0.21 
Weibull Shape Parameter 4.07 0.09 3.86 4.08 4.22 


*Parameter converted to original scale 


The expected increment in N(t) is given by the intensity function 
Ldt = Y(t)h(t)dt 
with A(t) the usual hazard function, namely 
h(jdt=Prt<T<t dt, E=1|T > 


In the counting process approach, the increment in the count (9.18) is modelled as a 
Poisson outcome, with mean given by the intensity function (e.g. in terms of time 
specific covariates). Under a proportional hazards assumption, the intensity is 


L(t) = YOA = YAM) exp (Bx) (9.19) 


typically with A(t) = ħo(t) exp (Bx) as in the Cox proportional hazards model. The 
intensity may equivalently be written 


L(t) = Yi(t)dAo(t) exp (Px) (9.20) 
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and so may be parameterised in terms of jumps in the integrated hazard Ap and a 
regression parameter. 
With observed data D = (N,(t), Y;(t), xi) the posterior for the parameters in (9.18) is 


P(B, Ao|D) x P(D|B, Ao)P(B)P(Ao) 


The conjugate prior for the Poisson mean is the gamma, so a natural prior for d^o has 
the form 


dho ~ G(cH(t), ©) (9.21) 


where H(t) expresses knowledge regarding the hazard rate per unit time (i.e. amounts to 
a guess at dAo(f)), and c > 0 is higher for stronger beliefs (Sinha and Dey, 1997). The 
mean hazard is cH(t)/c = H(t) and its variance is H(t)/c. Conditional on B, the 
posterior for Ag takes an independent increments form on dAg rather than Ag itself, 


dAo(t)|B, D ~ G(cH(t) XidNit), ¢ Zi Yi(t) exp (Bxi)) 


This model may be adapted to allow for unobserved covariates or other sources of 
heterogeneity (‘frailty’). This frailty effect may be at the level of observations or for 
some form of grouping variable. 

The above basis for counting processes is in terms of continuous time. In empirical 
survival analysis, the observations of duration will usually be effectively discrete, and 
made at specific intervals (e.g. observations on whether a subject in a trial has under- 
gone an event will be made every 24 hours) with no indication how the intensity changes 
within intervals. So even for notionally continuous survival data, the likelihood is a step 
function at the observed event times. 

If the observation intervals are defined so that at most one event per individual 
subject occurs in them, then we are approximating the underlying continuous model 
by a likelihood with mass points at every observed event time. Hence, the observation 
intervals will be defined by the distinct event times in the observed data. 

The prior (9.21) on the increments in the hazard might then amount to a prior for a 
piecewise function defined by the observation intervals — this approach corresponds to a 
non-parametric estimate of the hazard as in the Cox regression (Cox, 1972). But the 
hazard might also be modelled parametrically (Lindsey, 1995). 

One aspect of the counting process model is the ability to assess non-proportionality 
by defining time-dependent functions of regressors in hazard models of the form 


h(t) = M(H exp {Bx ywi} 


Thus w;( = x;g(t) might be taken as the product of one or more covariates with a 
dummy time index g(f) set to 1 up to time 7 (itself a parameter), and to zero thereafter. 
This is consistent with proportional hazards if y = 0. 

Another possibility in counting process models applied to event histories is the model- 
ling of the impact of previous events or durations in a subject’s history. Thus, the intensity 
for the next event could be made dependent on the number of previous events, in what are 
termed birth models (Lindsey, 1995, 2001, Chapters 1 and 5; Lindsey, 1999). 


Example 9.13 Leukaemia remissions We consider again the classical data from 
Gehan (1965) on completed or censored remission times for 42 leukaemia patients 
under a drug treatment and a placebo, 21 on each arm of the trial. A censored time 
means that the patient is still in remission. Here the observation interval is a week, and 
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of the 42 observed times, 12 are censored (all in the drug group). There are 17 distinct 
complete remission times, denoted t.dist[] in Program 9.13. Termination of remission is 
more common in the placebo group, and the effect of placebo (z; = 1) vs. treatment 
(zı = 0) on exits from remission is expected to be positive. 


The hazard is modelled parametrically, and for a Weibull hazard this may be achieved 
by including the log, survival times, or logs of times since the last event, in the log-linear 
model for the Poisson mean (e.g. Lindsey, 1995; Aitkin and Clayton, 1980). Thus 


L(t) = Y(dexp(Bz ~ «* log zt) 


where k* is the exponent in the Weibull distribution, k, minus 1. We might also take a 
function in time itself: 


L(t) = Y(t)exp(Bz te) 


and this corresponds to the extreme value distribution. For the Weibull, a prior for K 
confined to positive values is appropriate (e.g. a G(1, 0.001) prior), while for ¢ a prior 
allowing positive and negative values, e.g. an N(0, 1) density, may be adopted. 

Three chain runs of 5000 iterations show early convergence on the three unknowns in 
each model. We find (excluding the first 500 iterations) a Weibull parameter clearly 
above 1, though some analyses of these data conclude that exponential survival is 
appropriate (Table 9.14). The 95% credible interval for the extreme value parameter is 
similarly confined to positive values. The extreme value model has a slightly lower 
pseudo-marginal likelihood than the Weibull model (—101.8 vs. 102.8); this is based on 
logged CPO estimates aggregated over cases with Y(t) = 1 (Y[i, jJ=1 in Program 9.13). 
The exit rate from remission is clearly higher in the placebo group, with the coefficient 
on Z being entirely positive, and with average hazard ratio, for placebo vs. drug group, 
of exp (1.52) = 4.57. 


Example 9.14 Bladder cancer As an illustration of counting process models when 
there are repeated events for each subject, we consider the bladder cancer study con- 
ducted by the US Veterans Administrative Cooperative Urological Group. This involved 
116 patients randomly allocated to one of three groups: a placebo group, a group 
receiving vitamin B6, and a group undergoing installation of thiotepa into the bladder. 
On follow up visits during the trial, incipient tumours were removed, so that an event 


Table 9.14 Leukaemia treatment effect, Weibull and extreme value 


models 

Mean St. devn. 2.5% Median 97.5% 
Weibull 
Intercept —4.70 0.64 —6.06 —4.68 =3.52 
Placebo 1.52 0.41 0.74 1.51 2.37 
Shape 1.64 0.25 1.16 1.63 2.15 


Extreme value 


Intercept —4.31 0.49 —5.30 —4.30 —3.40 
Placebo 1.56 0.42 0.76 1.55 2.39 
Shape 0.090 0.030 0.029 0.091 0.147 
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history (with repeated observations on some patients) is obtained, with 292 events (or 
censorings) accumulated over the 116 patients. Times between recurrences are recorded 
in months, with many patients not experiencing recurrences (i.e. being censored). 

A beneficial effect of thiotepa would be apparent in a more negative impact B, on the 
recurrence rate than the two other treatment options. We compare (a) Weibull vs. 
piecewise hazards, and (b) a subject level Normal frailty vs. a birth effect (modelling 
the impact v of a count of previous recurrences), and follow Lindsey (2000) in using a 
criterion analogous to the AIC. 

Summaries are based on three chain runs of 2500 iterations after early convergence 
(between 500-750 iterations in all model options discussed). The first two models use a 
Weibull parametric hazard with shape parameter k. There is no apparent difference 
from the exponential null value (k = 1) when frailty is included at the patient level 
(Table 9.15). However, omitting frailty and allowing for the influence of the number of 
previous events (also a proxy for frailty) shows the k coefficient clearly below 1. There is 
a clear influence of previous events on the chance of a further one. However, neither 
model shows a clear treatment benefit. 

The average deviance for the latter model is around 182, and must be used in a 
criterion that takes account of the extra parameters involved in the random frailty 
effects. The DIC criterion adds the parameter count to the average deviance and so is 


Table 9.15 Models for bladder cancer; parameter summaries 


Mean St. devn. 2.5% Median 97.5% 


Weibull hazard and patient frailty 


a —3.22 0.31 —3.82 —3.22 —2.64 

B2 —0.008 0.316 —0.622 —0.010 0.598 
B3 —0.348 0.307 —0.972 —0.342 0.234 
K 1.028 0.101 0.824 1.027 1.223 
o? 0.890 0.352 0.375 0.833 1.713 


Weibull hazard and history effect 


a —2.93 0.21 —3.36 —2.93 —2.54 

B2 0.093 0.169 —0.254 0.094 0.419 
B3 —0.137 0.183 —0.501 —0.135 0.210 
K 0.852 0.076 0.701 0.849 1.002 
v 0.572 0.105 0.370 0.572 0.781 


Non-parametric hazard and history effect 


B2 0.054 0.166 —0.274 0.055 0.382 
B3 —0.141 0.181 —0.509 —0.134 0.218 
v 0.503 0.107 0.296 0.499 0.718 


Hazard profile (non-parametric hazard) 


DAO, 0.023 0.008 0.011 0.022 0.040 
DA02 0.057 0.014 0.033 0.056 0.088 
DAO; 0.083 0.019 0.051 0.081 0.125 
DAO, 0.075 0.018 0.044 0.074 0.121 
DAOs 0.050 0.014 0.027 0.049 0.082 


(continues) 
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Table 9.15 (continued) 


Mean St. devn. 2.5% Median 97.5% 
Hazard profile (non-parametric hazard) 
DAO. 0.088 0.023 0.049 0.085 0.139 
DAO, 0.030 0.013 0.011 0.028 0.060 
DAO; 0.033 0.014 0.012 0.031 0.066 
DAO, 0.054 0.020 0.022 0.052 0.098 
DA0io 0.020 0.012 0.005 0.018 0.048 
DAO, 0.015 0.010 0.002 0.012 0.039 
DAO. 0.061 0.023 0.026 0.058 0.118 
DA0)3 0.025 0.014 0.005 0.023 0.057 
DA0i4 0.028 0.017 0.005 0.024 0.072 
DA015 0.020 0.014 0.003 0.017 0.052 
DA016 0.022 0.016 0.003 0.018 0.061 
DA017 0.037 0.023 0.008 0.032 0.094 
DAOis 0.027 0.021 0.003 0.023 0.082 
DAO; 0.014 0.016 0.000 0.009 0.056 
DA02 0.014 0.014 0.000 0.009 0.052 
DAO», 0.015 0.014 0.000 0.011 0.050 
DA02) 0.031 0.023 0.004 0.026 0.089 
DA023 0.017 0.016 0.000 0.012 0.057 
DA0o4 0.038 0.026 0.005 0.032 0.103 
DA0d5 0.022 0.022 0.001 0.014 0.082 
DA026 0.022 0.022 0.000 0.015 0.080 
DA0.7 0.026 0.026 0.001 0.019 0.100 
DAO 0.028 0.028 0.001 0.019 0.101 
DA029 0.040 0.041 0.001 0.027 0.152 
DA039 0.065 0.065 0.002 0.049 0.258 


DAO03; 0.071 0.073 0.001 0.047 0.250 


187 for the history effect model. For the frailty model a subsidiary calculation gives an 
effective parameter count of 27.5, and deviance at the posterior mean of 146, so the DIC 
is 201. On this basis the ‘history effect’ model is preferred. 

For a non-parametric hazard analysis one may either set one of the treatment effects 
to a null value or one of the piecewise coefficients. Here the first treatment effect (a) is 
set to zero. Applying the history effect model again shows k below 1, but the average 
deviance is 176 and taking account of the 35 parameters, the DIC statistic (at 209) shows 
that there is no gain in fit from adopting this type of hazard estimation. The parametric 
hazard (combined with the history effect model) is therefore preferred. The rates dAo(?) 
for the first 18 months are precisely estimated and in overall terms suggest a decline, 
albeit irregular, in the exit rate over this period. Applying the remaining model option, 
namely non-parametric hazard with subject level frailty, and the assessment of its DIC, 
is left as an exercise. 


9.7 REVIEW 


While many of the earliest papers describing the application of MCMC methods to 
survival models (e.g. Kuo and Smith, 1992; Dellaportas and Smith,1993) are concerned 
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with parametric survival functions, non-parametric applications are also developed in 
Hjort (1990) and Lo (1993). A major benefit of the Bayesian approach is in the analysis 
of censored data, treating the missing failure times as extra parameters, with the form of 
truncation depending on the nature of censoring (e.g. left vs. right censoring). Censoring 
is generally taken as non-informative but circumstances may often suggest an informa- 
tive process. 

Parametric survival models are often useful baselines for assessing the general nature of 
duration dependence and parametric frailty models have utility in contexts such as nested 
survival data; see Example 9.12 and Guo and Rodriguez (1992). However, much work 
since has focussed on Bayesian MCMC analysis of non-parametric survival curves, 
regression effects or frailty. For example, Laud et al. (1998) develop an MCMC algorithm 
for the proportional hazards model with beta process priors. Example 9.14 illustrates a 
comparison of a parametric hazard and non-parametric hazard based on the counting 
process model, while Example 9.11 considers a non-parametric frailty. 

Recent reviews of Bayesian survival analysis include Ibrahim et al. (2001a), Kim and 
Lee (2002) and Rolin (1998). 
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EXERCISES 


1. In Example 9.2, apply multiple chains with diverse (1.e. overdispersed) starting points 
— which may be judged in relation to the estimates in Table 9.3. Additionally, assess 
via the DIC, cross-validation or AIC criteria whether the linear or quadratic model 
in temperature is preferable. 


2. In Example 9.4, consider the impact on the covariate effects on length of stay and 
goodness of fit (e.g. in terms of penalised likelihoods or DIC) of simultaneously (a) 
amalgamating health states 3 and 4 so that the health (category) factor has only two 
levels, and (b) introducing frailty by adding a Normal error in the log(mu[i]) equation. 


3. In Example 9.6, repeat the Kaplan—Meier analysis with the control group. Suggest 
how differences in the survival profile (e.g. probabilities of higher survival under 
treatment) might be assessed, e.g. at 2 and 4 years after the start of the trial. 


4. In Program 9.8, try a logit rather than complementary log-log link (see Thompson, 
1977) and assess fit using the pseudo Bayes factor or other method. 


5. InProgram 9.9 under the varying unemployment coefficient model, try a more informa- 
tive Gamma prior (or set of priors) on 1/ o% with mean 100. For instance try G(1, 0.01), 
G(10, 0.1) and G(0.1, 0.001) priors and assess sensitivity of posterior inferences. 


6. In Example 9.12, apply a discrete mixture frailty model at cluster level with two 
groups. How does this affect the regression parameters, and is there an improvement 
as against a single group model without frailty? 


7. In Example 9.14, try a Normal frailty model in combination with the non-parametric 
hazard. Also, apply a two group discrete mixture model in combination with the non- 
parametric hazard; how does this compare in terms of the DIC with the Normal frailty 
model? 
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CHAPTER 10 


Modelling and Establishing 
Causal Relations: 


Epidemiological Methods 
and Models 


10.1 CAUSAL PROCESSES AND ESTABLISHING CAUSALITY 


Epidemiology is founded in efforts to prevent illness by contributing to understanding 
the causal processes, or etiology, underlying disease. This includes establishing and 
quantifying the role of both risk factors and protective factors in the onset of ill-health. 
Risk factors include individual characteristics or behaviours, or external hazards that an 
individual is exposed to, that increase the chance that the individual, rather than 
someone selected randomly from the general population, will develop ill-health. Exter- 
nal risk factors may relate to the environment or community, and include material 
factors (e.g. income levels), psychosocial and biological risk factors in human popula- 
tions (e.g. Garssen and Goodkin, 1999). Epidemiological analysis extends to studies of 
disease in animal as well as human populations (Noordhuizen et al., 1997). 

Bayesian approaches to modelling in epidemiology, and in biostatistics more generally, 
have been the subject of a number of recent studies. Several benefits from a Bayesian 
approach, as opposed to frequentist procedures which are routinely used in many epi- 
demiological studies, may be cited (Lilford and Braunholtz, 1996; Spiegelhalter et al., 
1999). These include model choice procedures that readily adapt to non-nested models; 
availability of densities for parameters without assuming asymptotic normality, and the 
formal emphasis on incorporating relevant historical knowledge or previous studies into 
the analysis of current information (Berry and Stangl, 1996). Also advantageous are 
Bayesian significance probabilities (Leonard and Hsu, 1999) which fully reflect all uncer- 
tainty in the derivation of parameters. Of particular interest is Bayesian model choice in 
situations where standard model assumptions (e.g. linear effects of risk factors in logit 
models for health responses) need to be critically evaluated. On the other hand, Bayesian 
sampling estimation may lead to relatively poor identifiability or slow converegence of 
certain types of models, including models popular in epidemiology such a 
spline regressions (Fahrmeir and Lang, 2001) and issues such as informativeness of priors 
and possible transformation of parameters become important in improving identifiability. 

Most usually, causal processes in epidemiology involve multiple, possible interacting 
factors. Inferences about risk and cause are affected by the nature of the causal process, 
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and by the setting and design of epidemiological studies, whether clinical or community 
based, and whether randomized trial as against observational study. The main types of 
observational study are case-control (retrospective) studies, cross-sectional prevalence 
studies, and prospective or cohort studies (Woodward, 1999). Measures of risk are 
governed by study design: for example, in a case-control study the focus may be on 
the odds ratio of being exposed given case as against control status (Breslow, 1996), 
whereas in a cohort study, the focus is on risk of disease given exposure status. 

The major designs used have a wide statistical literature attached to them and statistical 
thinking has played a major role in the development of epidemiology as a science. Some 
authors, however, caution against routine application of concepts from multivariate analy- 
sis, such as using continuous independent variables to describe risk profiles, using product 
terms for evaluating interactions, or the notion of independent effects in the presence of 
confounding (Davey Smith and Phillips, 1992; Rothman, 1986, Chapter 1). Also, the goals of 
an epidemiological analysis may not coincide with a hypothesis testing approach. 

Underlying generalisations from epidemiological studies, and guiding the application 
of statistical principles in them, are concepts of causality in the link between risk factor 
and disease outcome. Thus, a causal interpretation is supported by: (a) strength in 
associations, after controlling for confounding, evidenced by high risk ratios or clear 
dose-response relationships; (b) consistent associations across studies, different possible 
outcomes, and various sub-populations; (c) temporal precedence such that exposure 
predates outcome (subject to possible latency periods); (d) plausibility of associations in 
biological terms; and by (e) evidence of a specific effect following from a single exposure 
or change in a single exposure. 


10.1.1 Specific methodological issues 


Describing the relationship between an outcome and a given risk factor may be compli- 
cated by certain types of interaction between risk factors. Confounding, considered in 
Section 10.2, relates to the entangling or mixing of disease risks, especially when the 
confounder influences the disease outcome, and is also unequally distributed across 
categories of the exposure of interest. It may often be tackled by routine multivariate 
methods for correlated or collinear independent variables. Other major options are 
stratification and techniques based on matching, such as the matched pairs odds ratio 
(Rigby and Robinson, 2000). 

Dose-response models (Section 10.3) aim to establish the chance of an adverse 
outcome occurring as a function of exposure level (Boucher et al., 1998). Rothman 
(1986, Chapter 16) argues that the leading aim of epidemiological investigation is to 
estimate the magnitude of effect (e.g. relative risks) as a function of level of exposure. 
This may indicate categorisation of a continuous exposure variable and classical calcu- 
lation of an effect according to a category of the exposure would require then 
sub-populations of sufficient size as a basis for precisely describing trend over categories. 
However, random effect models to describe trend (e.g. via state space techniques), 
especially in Bayesian implementations, may overcome such limitations (Fahrmeir and 
Knorr-Held, 2000). Establishing consistent relationships depends on the selected out- 
come and on the measurement of exposure: risk factors such as alcoholism or outcomes 
such as good health cannot be precisely operationalised and have to be proxied by a set 
of observable items, and so latent variable models come into play (Muthen, 1992). 

Meta-analysis (see Section 10.4) refers to the combination of evidence over studies 
and hence plays a role in establishing consistency of associations: it provides a weighted 
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average of the risk or treatment estimate that improves on rules-of-thumb, such as ‘most 
studies show an excess risk or treatment benefit’ (Weed, 2000). Findings of heterogen- 
eity in risk parameters across studies need not preclude consistency. Meta-analysis may 
also play a role in more precisely establishing the strength of an association or dose- 
response relationship, e.g. in providing a summary estimate of relative risk with im- 
proved precision as compared to several separate studies. 


10.2 CONFOUNDING BETWEEN DISEASE RISK FACTORS 


As noted above confounding occurs when a certain risk factor — not the focus of interest 
or with an established influence in scientific terms — is unequally distributed among 
exposed and non-exposed subjects, or between treatment and comparison groups. 
Often, the interest is in a particular risk factor X and it is necessary to adjust for 
confounder variables Z. Mundt et al. (1998) cite the assessment of lung cancer risk 
due to occupational exposure when there is in practice mixing of risk due to smoking 
and occupational exposure. If smoking is more prevalent among occupations with a 
cancer risk then the relationship between cancer and the occupational exposure would 
be over-estimated. Complications arise in adjusting for confounders if they are not 
observed (i.e. not explicitly accounted for in the model) or if they are measured with 
error. A model erroneously omitting the confounder, or not allowing for measurement 
error if it is included, leads to under-estimation of the average impact of X on Y (Small 
and Fischbeck, 1999; Chen et al., 1999). 

Whether Z is a confounder or not depends upon the nature of the causal pathways 
(Woodward, 1999). Z is not a confounder if Z causes X, and X is in turn a cause of Y 
(e.g. if Z is spouse smoking, X is exposure to tobacco smoke in a non-smoking partner, 
and Y is cancer in the non-smoking partner). If Z causes X and both X and Z were 
causal for Y, then Z is also not a confounder. 

Generally, it is assumed that the true degree of association between the exposure and 
the disease is the same regardless of the level of the confounder. If, however, the strength 
of association between an exposure and disease does vary according to the level of a 
third variable then Z is known as an effect modifier rather than a confounder. This is 
essentially the same as the concept of interaction in log-linear and other models. It is 
possible that such a third variable Z is a confounder only, an effect modifier only, or 
both an effect modifier and confounder". 


' Consider the example of Rigby and Robinson (2000) in terms of relative risk of an outcome (e.g. deaths 
from lung cancer) in relation to smoking (X) and tenure (Z), the latter specified as owner occupier (Z = 1), 
renter in subsidised housing (Z = 2), and private renter (Z = 3). Tenure would be a confounder, but not an 
effect modifier if the risk of cancer was higher among renter groups, but within each tenure category the 
relative risk of cancer for smokers as against non-smokers was constant at, say, 2. Suppose the mortality rate 
among non-smokers was 0.1 among owner occupiers, 0.15 among subsidised renters and 0.2 for private 
renters. Then the mortality rates among smokers in the ‘tenure as confounder only’ case would be 0.2 
among owners, 0.3 among subsidised renters and 0.4 among private renters. The overall relative risk depends 
then upon the distribution of tenure between smokers and non-smokers. If smoking is less common among 
owner occupiers, then ignoring housing tenure in presentation or risk estimation would lead to overstating the 
overall relative risk of mortality for smokers (e.g. estimating it at 3 or 4 rather than 2). 

Tenure would be an effect modifier in this example if the mortality rate among non-smokers was constant, at 
say 0.1, but the relative risk of cancer mortality for smokers as against non-smokers was higher among the renter 
subjects than the owner occupiers. There is then no association between tenure and mortality in the absence of 
smoking and differences in the relative risk between tenure categories reflect only effect modification. Tenure 
would be both a confounder and effect modifier when the global estimate of the smoking relative risk is 
influenced by the distribution of smoking across tenures, but the relative risk is concurrently different across 
tenures. If there is effect modification with genuine differences in relative risk (for smokers and non-smokers) 
according to the category of Z, then a global estimate of relative risk may make less substantive sense. 
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10.2.1 Stratification vs. multivariate methods 


One method to reduce the effect of a confounder Z is to stratify according to its levels 
(Zi, .. Zm}, and then combine effect measures such as odds ratios over strata, according 
to their precisions. Data from an Israeli cross-sectional prevalence study reported by 
Kahn and Sempos (1989) illustrate the basic questions (Table 10.1). Cases and non- 
cases (in terms of previous myocardial infarction) are classified by age (Z) and systolic 
blood pressure (X). 

Age is related to the outcome because the odds ratio for MI among persons over 60 as 
against younger subjects is clearly above 1. The empirical estimate is 


15 x 1767/(188 x 41) = 3.44 


with log(OR)=1.24 having a standard deviation? of (1/15 1/1767 1/188 
1 /41)°° = 0.31. Moreover, age is related to SBP since with age over 60 as the 
‘outcome’ and SBP over 140 as the ‘exposure’, the empirical odds ratio is 


124 x 1192/(616 x 79) = 3.04 


Providing there is no pronounced effect modification, it is legitimate to seek an 
overall odds ratio association controlling for the confounding effect of age. Suppose 
the cells in each age group sub-table are denoted {a, b, c,d} and the total as 
t=a b c d. 

To combine odds ratios OR; (or possibly log OR;) over tables, the Mantel-Haenszel 
(MH) estimator sums v; = ajd;/t; and 8; = bjc;/t; to give an overall odds ratio 


>or) ò: (10.1) 


i 


This is a weighted average of the stratum (i.e. age band) specific odds ratios, with weight 
for each stratum equal to 8; = b;ci/ti, since 


{aidi /(bici)}8; = aidi/ti = vi 
Table 10.1 Myocardial infarction by age and SBP 


Age Over 60 MI Cases No MI Allin SBP group 


SBP > = 140 9 115 124 
SBP < 140 6 73 79 
All in Age Band 15 188 203 


Age Under 60 


SBP > = 140 20 596 616 
SBP < 140 21 1171 1192 
All in Age Band 41 1767 1808 


? The standard error estimate is provide by the Woolf method which relies on the Normality of log(OR). 
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The weights 6; are proportional to the precision of the logarithm of the odds ratio 
under the null association hypothesis. For the data in Table 10.1 the estimator (10.1) 
for the overall odds ratio is 1.57, as compared to 0.95 and 1.87 in the two age 
groups. 

A stratified analysis may become impractical if the data are dispersed over many 
subcategories or multiple confounders, or if the impact of the risk factor is distorted by 
categorisation, and multivariate methods such as logistic regression are the only prac- 
tical approach. For example, logit regression methods are applicable to pooling odds 
ratios over a series of 2 x 2 tables from a case-control study, even if case/control status 
does not result from random sampling of a defined population (Selvin, 1998, Chapter 4). 
In the case of Bayesian estimation, credible intervals on the resulting effect estimates 
will be obtained without requiring Normality assumptions. This is especially important 
for small cell counts {a, b, c, d}, including the case when a cell count is zero such that 
the estimate ad/(bc) is undefined. 

Multivariate methods may be applied when there is intentional matching of a case 
with one or more controls on the confounders. The precision of a risk estimate (e.g. 
odds ratio or relative risk) will be increased if the control to case ratio M exceeds 1. As 
well as providing control for confounding per se, matching on risk factors with an 
established effect (e.g. age and cardiovascular outcomes, or smoking and lung cancer) 
may enhance the power of observational studies to detect impacts of risk factors of as 
yet uncertain effect (Sturmer and Brenner, 2000). Matched studies with a dichotomous 
outcome may be handled by conditional logistic regression, conditioning on the ob- 
served covariates in each matched set (each matched set of case and controls becomes a 
stratum with its own intercept). For 1:1 matching, this reduces to the standard logistic 
regression. 


Example 10.1 Alcohol consumption and oesophageal cancer with age stratification One 
possible analysis of studies such as that in Table 10.1 is provided by a logit or log-linear 
model for the predictor combinations produced by the confounder and the exposure. For 
instance, in Table 10.1 the assumption of a common odds ratio over the two age groups 
corresponds to a model with main effects in age and blood pressure only, but without an 
interaction between them. 


This example follows Breslow and Day (1980) and Zelterman (1999) in considering a 
case control study of oesophageal cancer (Y) in relation to alcohol consumption (X), 
where age (Z) is a confounding factor. Table 10.2 shows the age banded and all ages 
study data. 

The unstandardised estimate of the overall odds ratio (from the all ages sub-table) 


96 x 666/(109 x 104) = 5.64 


However, the association appears to vary by age (the corresponding estimate for the 
65-74 group being only 2.59). One might apply the Mantel-Haenszel procedure to 
obtain an aggregate effect pooling over the age specific odds ratios, though this is 
complicated by undefined odds ratios (from classical procedures) in the lowest and 
highest age groups. 

An alternative is a log-linear model based on Poisson sampling for the frequencies 
fyxz, with case-control status being denoted Y (=1 for control and 2 for case). A log 
linear model (Model A) corresponding to a common odds ratio over the six age groups 
is then specified as 
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Table 10.2 Case control data on oesophageal cancer 


Annual alcohol consumption 


Age group Over 80g Under 80 g 
25-34 Case 1 0 
Control 9 106 
35-44 Case 4 5 
Control 26 164 
45-54 Case 25 21 
Control 29 138 
55-64 Case 42 34 
Control 27 139 
65-74 Case 19 36 
Control 18 88 
75 Case 5 8 
Control 0 31 
All ages Case 96 104 


Control 109 666 


J yxz ~ Poilmyyz) 

log(myyz) =a By Yy 8z Evy Kyz Nyxz 
and the common odds ratio across sub-tables is estimated as ẹ = exp (eyy). Fairly 
diffuse N(0, 1000) priors are adopted for all the effects in this model. A two chain run 
with null starting values in one chain, and values based on a trial run in the other, shows 
convergence from 5000 iterations: the scale reduction factors for B, only settle down to 
within [0.9, 1.1] after then. 

The Bayesian estimation has the benefit of providing a full distributional profile for 

&; see Figure 10.1 with the positive skew in ¢ apparent. Tests on the coefficient (e.g. the 
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Figure 10.1 Posterior density of common odds ratio 
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probability that it exceeds 6) may be carried out by repeated sampling and accumulating 
over those iterations where the condition > 6 is met. The posterior mean of ọ (from 
iterations 5000-10 000) is 5.48 with 95% credible interval 3.68 to 7.83. 

The pseudo (log) marginal likelihood (Gelfand and Dey, 1994) for Model A, obtained 
by accumulating over log CPOs, is —90.1. The worst fit (lowest CPO) is for the 
observation of 1 case in the 25-34 age band and high alcohol use. The predictive loss 
criterion? of Sahu et al. (1997), Ibrahim et al. (2001) and Gelfand and Ghosh (1998) is 
1918, with w = 1. Classical criteria to assess whether this model is adequate include a x? 
criterion comparing f22z with m»z (see Breslow, 1996, Equation (7), namely 


X = SO faz — unz) /Varlunz) 
Z 


and this may be evaluated at the posterior mean of Model B. The relevant posterior 
means (with variances) are 0.35 (0.11), 4.1 (2.2), 24.5 (19.1), 40.3 (33.6), 23.7 (18.8) and 
3.15 (2.2). This yields x? = 6.7 (5 d.f.), and so suggests Model A is adequate. Note that 
the posterior variances of 7 are higher than maximum likelihood estimates (for 
instance, those provided by Zelterman (1999) stand at 0.22, 2.1, 7.8, 10.6, 6.3 and 1). 

A model allowing for different odds ratios between caseness and high alcohol con- 
sumption according to the confounder level involves adding a three way interactions 
Ayyz to the above model. If these are taken to be fixed effects (Model B), then odds 
ratios for the lowest and highest age groups are still effectively undefined (though are no 
longer infinity providing the priors are proper). One may assess the fit of Model B as 
compared to Model A via the above general criteria; these are in fact in conflict, with the 
pseudo-marginal likelihood increasing to —86.5, but the predictive loss criterion 
worsening (as compared to Model A), namely to 1942. This worsening is in fact only 
apparent for small w, and for w > 5, the predictive loss criterion also favours allowing 
age specific odds ratios. 

However, a random effects model (Model C) for the three way effects (and possibly 
other parameters) is also possible, and will result in some degree of pooling towards the 
mean effect, while maintaining age differentiation if the data require it (Albert, 1996). A 
random effects approach also facilitates a more complex modelling option, considered 
below, that involves choosing between a precision 1/02 for Xyyz which is effectively 
equivalent to X\yyz = 0 (the common odds ratio model) and a precision which allows 
non-zero three way effects. Fitting Model C with a gamma prior G(1, 0.001) on 1/0? 
gives a relatively small ox for the Xyyz and age band odds ratios (phi.sr[] in Program 
10.1) all between 5.45 and 5.6. This option has a better predictive loss criterion than 
Model B and broadly supports pooling over age bands. 

Finally, we apply the specific model choice strategy, with 1/0? set to 1000000 for 
effective equivalence to \yyz = 0. A binary indicator chooses between this option and 
the prior 1/0? ~ G(1, 0.001) with equal prior probability. The option 1/0? = 1000000 
is chosen overwhelmingly, and so this procedure suggest age differentiation in the odds 


> Let f; be the observed frequencies, @ the parameters in the log-linear model, and z; be ‘new’ data sampled 
from f(z|0). Suppose v; and ç; are the posterior mean and variance of z;, then one possible criterion for any 
w > Oils 


D= oq ww DIS m-s 
i=l i=l 


Typical values of w at which to compare models might be w = 1, w = 10 and w = 100000. Larger values of w 
put more stress on the match between v; and f; and so downweight precision of predictions. 
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ratios is not required (and hence that age acts as a confounder rather than an effect 
modifier). 


Example 10.2 Framingham follow up study for CHD An illustration of confounding 
influences in a cohort study is provided by data on development of coronary heart 
disease during an 18 year follow up period among 1363 respondents included in the 
Framingham study (Smith, 2000). At the start of follow up in 1948, the study partici- 
pants were aged between 30 and 62, and the development of CHD among 268 partici- 
pants is related to age, sex and Systolic Blood Pressure (SBP) at exam 1. 


The aim is to control for confounding by age Zı and sex (Z2 = 1 for males, 0 for 
females) in the relation between CHD onset and systolic blood pressure (X), which 
for these subjects ranges between 90 and 300. Specifically an odds ratio ¢ comparing 
CHD onset probability for participants with initial SBP above and below 165mm Hg is 
sought. To investigate the impact of categorising continuous predictor variables, 
whether the risk factor itself or a confounder, different types of logit regression may 
be considered. The first is linear in the continuous predictors age and SBP, the second 
converts these predictors to categorical form, and the third considers non-linear func- 
tions of age and SBP. 

The first logit regression takes both age and SBP as continuous with linear effects 
(Model A in Program 10.2). A three chain run* shows convergence at around 750 
iterations and summaries are based on iterations 1000-5000. The resulting equation 
for CHD onset probability (with mean and posterior SD of coefficients) is 


logit(m) =a BZ, BX B32) 
= —7.2 0.050Z, 0.0171X 0.927, 
(0.9) (0.017) (0.002) (0.15) 


Note that the maximum likelihood solution (from SPSS) is very similar, the only slight 
difference being that the ML estimation has a coefficient on age of 0.052 with SD of 
0.015. Mildly informative N(0, 10) priors are used for the coefficients on the continuous 
predictors to avoid numerical overflow (which occurs if large sampled values for 
coefficients are applied to high values for Age or SBP). It may be noted that the logit 
link is more robust to extreme values in the regression term than alternatives such as the 
probit or complementary log-log links. Another option is to scale the age and SBP 
variables, for example to have a range entirely within 0 to 1 (e.g. dividing them by 100 
and 300, respectively), or to apply standardisation. 

In this first analysis, the original scales of age and SBP are retained and it is necessary 
obtain the average SBP in the group with SBP above 165 (namely 188.4) and the 
remainder (namely 136.5). The relevant odds ratio, under this continuous regressors 
model, is then the exponential of the coefficient for SBP times the difference 
188.4 — 136.5 = 51.9 


b = exp (B2 x 51.9) 


As noted by Fahrmeir and Knorr-eld (2000), an advantage of MCMC sampling is that 
posterior densities of functionals of parameters (here of B,) are readily obtained by 


4 Starting values are provided by null values, the posterior average from a trial run, and the 97.5th point from 
the trial run. 
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repeated sampling. Thus, the mean and 95% credible interval for b are obtained as 2.44 
(2.09, 2.89). 

Standard fit measures (e.g. predictive loss criteria or pseudo marginal likelihood) may 
be applied. Thus the criterion in footnote 3 with w = 1 stands at 299.6 and the pseudo- 
marginal likelihood at —630.5. To further assess fit and predictive validity the risk 
probabilities m; may be arranged in quintiles, and the cumulated risk within each 
quintile compared with the actual numbers in each quintile who developed the disease. 
Thus, among the sample members with the lowest 273 risk probabilities (approximately 
the lowest 20% of the 1363 subjects) we find the number actually developing CHD, then 
apply the same procedure among those ranked 274-546, and so on. There is some 
departure in the logit model prediction from the actual risk distribution, as in Table 
10.3. Note that this is best done by monitoring the m; (using the Inference/Summary 
procedure in BUGS) and then using other programs or spreadsheets to reorder the 
cases. Program 10.2 also contains the array rank[1363, 5] that monitors the quintile risk 
category of each subject. 

We next consider a categorical regression analysis (Model B in Program 10.2), with 
the first continuous predictor SBP dichotomised at above and below 165, and the age 
predictor forming a four fold category, denoted AgeBand[] in Program 10.2: age under 
50, between 50-54, between 55-59, and over 60. As usual a corner constrained prior is 
used with y; = 0 and yj ~ N(0, 1000) for j = 2, 3, 4. A three chain run’ shows conver- 
gence at around 300 iterations and summaries are based on iterations 500-5000. Thus 
the model, with /(s) = 1 for s true, is 


logit(7;) =a BıZ2 B(X > 165)  y[AgeBand] 


Standard fit measures show a worse fit under this model. Thus, the predictive loss criterion 
stands at 303.2 and the pseudo-marginal likelihood at —641. The match of actual and 
predicted risk is assessed over the 16 possible risk probabilities, formed by the high and low 
categories of SBP and the four age groups (part b of Table 10.3). This shows an acceptable 
fit (cf. Kahn and Sempos, 1989). The posterior median of the odds ratio b between high and 
low SBP subjects controlling for age confounding is estimated at around 2.74. 

Specific types of nonlinear regression models have been proposed for representing 
risks (Greenland, 1998a). For example, a flexible set of curves is obtained using 
fractional polynomial models, involving the usual linear term, one or more conventional 
polynomial terms (squares, cubes, etc.), and one or more fractional or inverse powers 
(square root, inverse squared, etc.). A simple model of this kind in, say, SBP might be 


logit(m) =a BiZ BX PBX? Bax” 


For positive predictors X, log,(X) can be used instead of X°° to give a curve with 
a gradually declining slope as x increases (Greenland, 1995). In fact, inclusion of 
log, (X) allows for the possibility of non-exponential growth in risk; for instance, 
exp (B log, (X)) = X? can increase much slower than exponentially. Fractional polyno- 
mials and spline regression have been advocated as improving over simple categorical 
regression (Greenland, 1995); possible drawbacks are potentially greater difficulties in 
identifiability and convergence, and also the desirability of ensuring sensible dose- 
response patterns. For example, a polynomial model in SBP, while identifiable, might 
imply an implausibly declining risk at SBP above a certain point such as 275. 


> Starting values are provided by null values, the posterior average from a trial run, and the 97.5th point from 
the trial run. 
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Table 10.3 Alternative logistic regression models to assess risk according 
to SBP (Framingham study) 


(a) All Continuous Predictors treated as such 


Quintile of risk Observed Expected under Chi square 
Probability logistic 

Ist 16 22.3 1.80 
2nd 30 35.6 0.88 
3rd 64 48.5 4.95 
4th 65 64.4 0.01 
5th 93 98.7 0.33 
Total 268 269.6 7.96 


(b) Continuous predictors in category form 


Summing over 16 Possible Risk Probabilities 


Expected under Observed Risk Chi square 
logistic probability 
15.3 13 0.081 0.41 
16.9 15 0.101 0.25 
4.6 7 0.132 0.81 
18.5 18 0.134 0.01 
31.6 36 0.173 0.53 
8.2 7 0.195 0.20 
34.9 35 0.210 0.00 
14.3 16 0.234 0.19 
12.4 13 0.263 0.03 
37.9 36 0.269 0.10 
Se, 3 0.292 1.68 
20.5 25 0.297 0.82 
9.4 9 0.362 0.02 
14.7 14 0.419 0.03 
4.9 5 0.490 0.00 
17.5 16 0.499 0.14 
266.8 268 5.22 


Here the coefficient selection procedure of Kuo and Mallick (1998) is applied to the 

specification 
logit(m;) =a BıZı BZ; B3 log, (Z1) 
B4X BX? Belog.(X) B22 

Age and SBP are obtained by dividing the original values by 100 and 300, respectively. 
Binary selection indicators, with Bernoulli(0.5) priors, are applied to the coefficients 
Bı — Bg. A single run of 10000 iterations (see Model C in Program 10.2) shows B, and 
Be to have posterior selection probabilities exceeding 0.98, while the remaining coeffi- 


cients have selection probabilities below 0.10. A third logit model is therefore estimated, 
namely 


logit(m)=a BiZ? Bslog(X) ByZ (10.2) 
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This yields a slight improvement in pseudo-marginal likelihood over the linear continu- 
ous predictors model above (—628.5 vs. —630.5) and in the predictive loss criterion with 
w=1 (namely 298.4 vs. 299.6). The parameter summaries for Model (10.2), from 
iterations 500-5000 of a three chain run, are in Table 10.4. The odds ¢ ratio is very 
similar to those previously obtained. 


Example 10.3 Larynx cancer and matched case-control analysis The impact of match- 
ing to control for confounders and clarify the risk attached to an exposure of interest is 
illustrated by an example from Sturmer and Brenner (2000). They consider the utility of 
matching in case-control studies on risk factors whose effect is established and of no 
substantive interest. The interest is rather in the impact of a new suspected risk. They 
cite existing case-control findings on the link between larynx cancer and smoking (four 
categories, namely 0-7, 8-15, 16-25, over 25 cigarettes per day) and alcohol consump- 
tion (bands of 0—40, 40-80, 80-120, and over 120 grammes per day). Table 10.5 shows 
the relative distribution of cases and population between the 16 strata formed by 
crossing these two risk factors. 


The impact of smoking and alcohol is established, and the interest is in the impact of 
case-control matching to assess the effect of a new putative risk X. We compare matched 


Table 10.4 Nonlinear risk model, parameter summary 


Mean St. devn. 2.5% Median 97.5% 


Odds Ratio 2.63 0.34 2.05 2.61 3.39 
a —1.17 0.51 —2.11 —1.20 —0.13 
Bi 4.81 1.39 1.90 4.87 7.41 
B2 2.98 0.40 2.23 2.98 3.79 
B3 0.93 0.15 0.64 0.93 1.22 


Table 10.5 Larynx cancer cases and controls across established risk factor combinations 


Stratum Smoking rate Alcohol Exposure risk Proportion of Proportion of 
identifier (no. of cigarettes consumption to X (in population cases belonging population 
per day) and controls) to stratum defined belonging to 
under moderate by known risk stratum 
confounding factors 
1 0-7 0-40 0.01 0.010 0.168 
2 41-80 0.02 0.024 0.140 
3 81-120 0.03 0.017 0.053 
4 Over 120 0.04 0.027 0.031 
5 8-15 0-40 0.02 0.022 0.081 
6 41-80 0.04 0.078 0.092 
7 81-120 0.06 0.068 0.043 
8 Over 120 0.08 0.095 0.023 
9 16-25 0-40 0.03 0.066 0.081 
10 41-80 0.06 0.103 0.09 
11 81-120 0.09 0.127 0.045 
12 Over 120 0.12 0.137 0.035 
13 26 0-40 0.04 0.012 0.043 
14 41-80 0.08 0.037 0.034 
15 81-120 0.12 0.054 0.025 
16 Over 120 0.16 0.122 0.015 
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sampling, with controls sampled according to the case profile (i.e. the proportionate 
distribution among the 16 strata, as in the penultimate column in Table 10.5), with 
unmatched sampling. Under unmatched sampling, the sampling of controls is according 
to the population profile, given by the last column of Table 10.5. 

For illustration, M = 2 controls are taken for each of 200 cases, and the exposure 
disease odds ratio in each stratum (the odds ratio of exposure to X given case-control 
status) is assumed to be 2. Sturmer and Brenner then generate samples with 200 cases 
and 400 controls to establish the power to detect this effect size under various assump- 
tions about the confounding of the new risk factor X with the established risk factors Z 
and Z: (smoking and alcohol consumption). In Program 10.3 it is necessary to generate 
both the stratum (defined by Zı and Z2) from which an individual is sampled, and 
exposure status to X; for cases these are indexed by arrays Stratcase[ ] and Exp.case[ ]. 

Under the first assumption there is no confounding, with an exposure rate to the new 
factor X (proportion exposed to X in strata 1 to 16) set at 0.05 in all strata. Under an 
alternative moderate confounding assumption, the exposure rate rises in increments 
from 0.01 in the lowest smoking and alcohol intake group to 0.16 in the highest smoking 
and drinking group (see Table 10.5). 

Sturmer and Brenner report higher powers to establish the assumed odds ratio of 2 
under matched than unmatched sampling, and higher powers also under moderate 
confounding than no confounding. The analysis here confirms the ability of matched 
case-control sampling to obtain the correct odds ratio regardless of the confounding 
scenario, and the greater power to detect a positive odds ratio under moderate con- 
founding rather than no confounding. 

Under matched sampling both cases and controls are drawn to have the same 
distribution across the 16 strata, namely that in the penultimate column of Table 
10.5. It is necessary to assess the power of the study to detect a positive relation between 
exposure and disease. The test used to establish the significance of the log of odds ratio 
(and hence power of the study) for each sample of 600 involves the empirical variance of 
the log of the odds ratio over all strata combined. It is preferable to use the log of the 
odds ratio to assess power as this is more likely to be approximately Normal, whereas 
the odds ratio itself is usually skewed. 

Thus, let A, B, C and D be exposed cases, unexposed cases, exposed controls and 
unexposed controls respectively accumulated over all strata, with the stratum equiva- 
lents being aj, bj, c; and dj. So the variance of k = log, (OR) is 


1/A 1/B 1/C 1/D 


where A = X;a;, B = X;bj, C = Njc; and D = X;d;. A refinement is to form the Mantel- 
Haenszel estimate ORmy of the overall odds ratio, with weighting of the stratum odds 
ratios according to their precision. 

A run of 5000 iterations with moderate confounding and matched case-control 
sampling with M = 2 leads to a power of 70.6% to detect a positive odds ratio at 
2.5% significance (compared to 71.1% obtained by Sturmer and Brenner) and an 
estimated mean OR of 2.02. Estimates using the Mantel-Haenszel procedure are very 
similar, but involve slower sampling. When there is no confounding across the strata 
formed by Zı and Z2, but still matched case-control sampling, the power is reduced to 
around 54% and the mean odds ratio is 2.12 (and median 2.01). 

Under unmatched sampling with any degree of confounding the crude odds ratio is 
an overestimate. To allow for the fact that, under this type of sampling, controls are 
sampled disproportionately from strata with low exposure risk, one may adjust the 
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crude odds ratio to take account of differential exposure to risk. One may obtain the 
ratio of average exposure to risk among cases as compared to average exposure among 
controls on the basis of a standard risk profile p; (exposed to risk of X) over the strata. 
Thus, Table 10.5 shows the population (and control) risk profile under moderate 
confounding, and it can be seen that sampling from the case distribution Teaser | (penulti- 
mate column) leads to higher average exposure than sampling from the population 
distribution Tpoprj (last column). A run of 10000 iterations estimates the median of the 
ratio Rexp = UpjTeasefj|/UP;Mpop[j] at 1.93 on the basis of the actual sampling propor- 
tions over strata at each iteration. This is used to calculate adjusted totals C’ = C.Rexp 
of exposed controls, and D’ = 400 — C’ of unexposed controls. The median crude OR is 
3.98, and the median of the adjusted OR is then 1.98. The log of the adjusted OR 
is found to have a variance of 0.34? from a trial run, and from this a power of 52% 
(Sturmer and Brenner obtain 51%) to detect an association between disease and expos- 
ure is obtained, compared to 70.6% under matched sampling. 

A wide range of alternative scenarios may be investigated; for example Sturmer and 
Brenner (2000) consider a strong confounding scenario with the exposure risk ranging 
from 0.005 in stratum 1 to 0.32 in stratum 16. Alternative numbers M of matched 
controls may also be taken (e.g. up to M = 5). 


Example 10.4 Obesity and mental health Increasingly, health strategy and measures 
of health and clinical gain focus on improving quality of life, as well as extending life 
expectancy. These measures in turn depend upon valuations of health status, founded in 
utility theory, with different health states being assigned differing utilities ranging from 
0 (death) to 1.0 (perfect health), or possibly scaled to run from 0 to 100. Following Doll, 
Petersen and Brown (2000), we analyse responses on an instrument used to assess health 
status and quality of life in both clinical and community settings, namely the Short 
Form 36 or SF36 questionnaire (Jenkinson et al., 1993). This questionnaire has eight 
subscales, and includes items on particular aspects of physical and mental health and 
function. Here, observed subscale totals on the SF36 are used to measure the broader 
latent dimensions of physical and mental health. We then examine the associations 
between scores on these dimensions, actual reported illness status, and obesity, also 
observed directly. 


Doll et al. report on studies finding an adverse impact of obesity on mental health, in 
addition to the established (and clinically plausible) impact of obesity on physical 
health. Other studies, however, have not found an association between emotional 
disturbance and obesity. Doll et al. suggest that some existing studies may not be 
controlling for confounding of the link between obesity (X) and mental health (F) by 
illness status (Z). Thus, obese people are more likely to have chronic illness, and once 
this is allowed for there may be no impact of obesity per se on emotional health. 
Specifically, Doll et al. combine obesity (X; = 1 for yes,= 0 for no) and chronic illness 
(Z; = 1 for yes, = 0 for no) into a composite indicator J;. They find no difference in 
mental health between those with no illness and no obesity (X; = Z; = 0) and those 
obese only without being ill (X; = 1, Z; = 0). 

The work of Doll et al. illustrates that a set of items may contain information on more 
than one latent dimension. Thus they use the eight items from the Short Form 36 Health 
Status Questionnaire to derive mental and physical health factor scores, though they 
assume these factors are uncorrelated (orthogonal) in line with the SF36 developers’ 
recommendations (Ware et al., 1994). In this connection, we consider six of the eight 
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items of the SF36 recorded for 582 women aged 65-69 in the 1996 Health Survey for 
England. The selected items have values from 0 to 100, with the low score corresponding 
to most ill on all items and the high score to most well. Two items were excluded, 
because their distribution was highly spiked (concentrated on a few values) despite being 
nominally continuous variables. The density of the other scores is also skewed, with a 
bunching of values on all the items at 100 (the ‘ceiling effect’ in health status measure- 
ment). One might consider truncated sampling combined with a density allowing for 
skewness, and below a log-normal model is adopted — which reflects the minimum of the 
items being non-negative. Other options for sampling might be envisaged, such as a beta 
density or even a binomial, if we round non-integer values between 0 and 100. 

In fact, the binomial provides a simple way of dealing with missing values in the 
health status outcomes, and is here used exclusively for that purpose — it provides an 
integer ‘success’ total Vj; between 0 and 100 in relation to a number at risk Nj of 100 (for 
i= 1,..582 and j = 1, 6). It is necessary to impute missing values for the six SF36 items 
to be able to use the log-normal model (applied to the observed and imputed data 
combined as if it were all observed). The low rate of item missingness in these data is 
thus modelled according to 


Vij ~ Bin(ay, Nj) 
logit(m;) = Yy Yy X Vi 


where V; is the total score on all six items, and is a (relatively crude) measure of overall 
health status. For illustration, a single imputation is used to ‘fill out’ the health status 
outcomes, though a full multiple imputation would use several imputations of the 
missing data, possibly generated under different non-response mechanisms. 

We then relate the logged scores vı—ve on the six observed items, Vı to Ve, (SF36 
Physical Health, Pain, General Health, Vitality, Social Function, SF36 Mental Health) 
to the 2 hypothesised latent constructs, also denoted physical and mental health, with 
symbols F; and Fy. (Note that pain scores are higher for lower reported levels of pain.) 
Thus items 1-3 are assumed to be linked to the physical health factor, and items 4-6 to 
the mental health factor. For subject i 


Vii =ò, Buk en 

Vai =ò PBF ez 

V3; =ð PBizFii ezi 

Vai = ò4 Bozi Cai 

Vsi = òs Bosko; esi 

Vei = òs PsF2i esi 
where the e; are independent Normal errors with zero means (with G(1, 1) priors on 
their precisions 1;). For identifiability the constraint B,;, = B24 = 1 is adopted (see 
Chapter 8). The factors are uncorrelated, and allowed to have free variances and 
means which differ by obesity status X, by illness type Z or by illness-obesity combined 
in analyses denoted (a), (b) and (c), respectively. Body mass X has categories below 20, 


20-25, 25-30 and 30 , and illness Z has three categories (ill, slightly ill, well). 
Thus in Model (a), 


Fri Gy NOx ;) 
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with means vj, varying over obesity category j and the k = 1, 2 factors. The precisions 
1/d, are taken to be G(1, 0.001). Since relativities between categories are the main 
interest, it may be assumed that vı = 0, with centred parameters then obtained as 
Vig = Vik — Vx. In Model (b) the means are defined over illness and factor: 


Fri ~ Nz x, Oy) 


and in Model (c) over eight joint obesity and illness categories, with well and slightly ill 
combined. 

Convergence on all three models is apparent after 1000 iterations in Models (a) and 
(b) (5000 iterations in Model (c)) in a two chain run of 5000 iterations (10000 in Model 
(c)), and applying the over-relaxation option. Starting values in one chain are null 
values, and for the other are based on trial preliminary runs. Fit is assessed via the 
predictive loss criterion of Ibrahim et al. (2001) and the pseudo-marginal likelihood of 
Gelfand (1995). 

In Model (a) it appears that the obesity group means on the two factors show the worst 
health for the low BMI group; their physical health score of —0.32 is clearly worse than 
other levels of BMI and their emotional health is significantly negative (Table 10.6). It 
may be that low BMI is a proxy for certain types of emotional disturbance. The CPOs 
suggest potential outliers; for instance subject 447 has a low CPO on item 6, where the 
score is 0, despite having scores of 100 on social function. This model has pseudo 
marginal likelihood of —16 980 and predictive loss criterion (with w = 1) of 7747 x 10°. 


Table 10.6 Factor means by BMI and/or illness band 


Model (a) Factor Means varying by BMI 

Physical Health Factor Mean St. devn. 2.5% 97.5% 
Mean by BMI Band 1 —0.32 0.20 —0.68 0.03 
Mean by BMI Band 2 0.30 0.07 0.15 0.44 
Mean by BMI Band 3 0.13 0.07 —0.01 0.28 
Mean by BMI Band 4 —0.11 0.10 —0.31 0.08 
Mental Health Factor 

Mean by BMI Band 1 —0.10 0.04 —0.18 —0.01 
Mean by BMI Band 2 0.08 0.02 0.04 0.12 
Mean by BMI Band 3 0.03 0.02 —0.01 0.07 
Mean by BMI Band 4 —0.01 0.03 —0.06 0.04 


Factor Loadings 


Bio 0.87 0.06 0.77 0.99 
Bis 0.49 0.04 0.43 0.57 
Bog 2.03 0.17 1.72 2.41 
Bos 1.93 0.19 1.58 2.33 


Factor Variances 


Var(F1) 0.46 0.05 0.37 0.56 
Var(F2) 0.05 0.01 0.04 0.07 


(continues) 
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Table 10.6 (continued) 


Model (b) Factor Means varying by Illness Type 


Physical Health Factor Mean St. devn. 2.5% 97.5% 
Ill —0.55 0.04 —0.63 —0.46 
Slightly Ill 0.21 0.04 0.13 0.29 
Well 0.34 0.04 0.26 0.42 
Mental Health Factor 

Ill —0.14 0.02 —0.17 —0.11 
Slightly Ill 0.05 0.01 0.02 0.08 
Well 0.09 0.01 0.06 0.11 


Factor Loadings 


Biz 0.85 0.05 0.75 0.96 
Biz 0.52 0.04 0.46 0.60 
Boa 2.12 0.16 1.81 2.47 
Bos 1.94 0.18 1.59 2.31 


Factor Variances 


Var(F1) 0.31 0.04 0.25 0.39 
Var(F2) 0.04 0.01 0.03 0.05 


Model (c) Factor Means varying by Combined Illness and BMI Type 


Physical Health Factor Mean St. devn. 2.5% 97.5% 
Ill and Low BMI —0.48 0.33 —1.28 —0.04 
Ill and Avg BMI —0.04 0.10 —0.25 0.16 
Ill and Above Avg BMI —0.20 0.08 —0.36 —0.04 
Ill and High BMI —0.72 0.15 —0.95 —0.36 
Well or Slight Ill, and Low BMI 0.35 0.18 0.01 0.70 
Well or Slight Ill, and Avg BMI 0.43 0.08 0.29 0.59 
Well or Slight Ill, & above avg BMI 0.36 0.08 0.21 0.51 
Well or Slight Ill, and High BMI 0.30 0.09 0.13 0.48 
Mental Health Factor 

Ill and Low BMI —0.16 0.09 —0.35 0.01 
Ill and Avg BMI 0.01 0.04 —0.06 0.08 
Ill and Above Avg BMI —0.03 0.03 —0.08 0.02 
Ill and High BMI —0.16 0.04 —0.23 —0.06 
Well or Slight Ill, and Low BMI 0.08 0.06 —0.05 0.20 
Well or Slight Ill, and Avg BMI 0.10 0.03 0.05 0.15 
Well or Slight Ill, & above avg BMI 0.07 0.02 0.02 0.12 
Well or Slight Ill, and High BMI 0.10 0.03 0.05 0.16 


Factor Loadings 


Bio 0.80 0.05 0.70 0.92 
Bi3 0.47 0.03 0.40 0.54 
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Bog 2.04 0.14 1.78 2.33 
Bos 1.94 0.17 1.63 2.28 


Factor Variances 


Var(F1) 0.38 0.04 0.30 0.46 
Var(F2) 0.04 0.01 0.03 0.05 


A more convincing difference in mental health means is apparent for illness categories 
— analysis (b). The ill subjects have significantly worse mental health, though slightly ill 
as against well subjects do not differ in their mental health scores. This model has a 
higher pseudo marginal likelihood but worse predictive criterion than Model (a) — an 
example of conflict in model assessment criteria. 

In a third analysis, analysis (c), the least two serious illness categories are combined 
and the resulting binary illness index crossed with the obesity categories. In terms of 
mental health a virtually flat profile in means over BMI can be seen for the less ill 
categories. Only when combined with more serious illness are both high BMI and low 
BMI associated with worse emotional health (though the interaction between low BMI 
and illness is not quite significant at the 5% level in terms of negative mental health). 
This model has a better predictive criterion than Models (a) or (b), but only improves in 
terms of pseudo-marginal likelihood over model (a). These findings replicate those of 
Doll et al. quite closely even though the analysis here is confined to one demographic 
group. Specifically, obesity does not have an independent effect on mental health and its 
impact is apparent only when combined with more serious illness. 


10.3 DOSE-RESPONSE RELATIONS 


Dose-response models typically aim to establish the probability of an adverse effect 
occurring as a function of exposure level (Boucher et al., 1998), or of health gain from 
treatment inputs. They may derive from experiments involving human or animal 
subjects, or from observational and cohort studies. Evidence of a monotonic trend in 
the risk of disease over different exposure levels of a risk factor, lends support to a 
causal relationship, and provides a basis for public health interventions. A monotonic 
downward trend in disease risk with increased levels of a putative protective factor may 
also be relevant (e.g. cancer in relation to vegetable and fruit consumption). 

The National Research Council (NRC, 1983) places dose-response assessment as one of 
series of stages in risk assessment, which includes hazard identification and hazard 
characterisation. Within the characterisation stage, risk assessment involves establishing 
a dose-response relationship and the site and mechanism of action. For example, in studies 
of developmental toxicology, hazard identification includes establishing whether new 
chemicals impair development before humans are exposed to them, and so the chemicals 
are evaluated in experimental animals to assess their effect on development. Hasselblad 
and Jarabek (1996) consider possible benefits of a Bayesian estimation approach in these 
situations, for example in obtaining the lower confidence point of the ‘benchmark dose’ 
that produces a 10% increase in the chance of a developmental abnormality. 

Quantification of exposure and of the resulting risk are central in framing and 
assessing dose-response relations. In some circumstances, in designed trials or cohort 
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studies, exposure to relevant risk factors may be intrinsically graded into a discrete 
number of levels, while in other instances an originally continuous exposure may be 
grouped into categories. Incidence rates may not be meaningful unless they are calcu- 
lated for reasonably sized sub-populations, and if exposure is measured on a continuous 
scale then this is not possible (Rothman 1986, Chapter 16). One then typically compares 
estimates of effect for each category in comparison to a reference category (such as the 
lowest dosage exposure group). These may be obtained by regression methods, or by 
stratifying over a confounder at each level of the outcome, and forming a pooled 
estimate with weights based on a common standard for the effect at each level. 

The latter method may be illustrated by case-control data from Doll and Hill (1950) 
on lung cancer in relation to daily smoking, with 60 matched female cases and controls 
and 649 male cases and controls (Table 10.7). The weights are based on the distribution 
of the two levels of the confounder (male, female) among the controls (Miettinen, 1972), 
so that male and female weights are respectively w; = 0.915 (=649/709) and w2 = 0.085. 
An empirical estimate of the rate ratio of lung cancer for 1—4 cigarettes as compared to 
zero cigarettes is obtained by comparing the weighted total of the ratios of exposed 
cases to exposed controls with the weighted total of the ratios of unexposed cases to 
unexposed controls. These are 0.915(55/33) 0.085(12/7) and 0.915(2/27) 0.085(19/32), 
respectively, so that the estimated effect (here a rate ratio) is 5.07. For 5-14 and 15 
cigarettes the corresponding estimates are 7.98 and 12.09. 

In a Bayes implementation, one would seek to allow for sampling uncertainty (e.g. 
illustrated by the small number of male cases at the lowest exposure level). Thus one 
might assume multinomial sampling conditional on the four totals (male controls, male 
cases, female controls, female cases). With a Dirichlet prior on the four sets of prob- 
abilities one obtains posterior mean rate ratio estimates for exposure levels r = 2, 3, 4 of 
4.71 (s.d. 2.2), 7.24 (3.1) and 10.9 (4.6). The Bayes procedure’ clarifies the uncertainty in 
the empirical estimates, and shows they overstate the risk relative to baseline exposure. 

A possible drawback in using a categorisation with several (R) levels of an originally 
continuous risk factor means that confidence (credible) intervals in the resulting effect 
estimates do not reflect the relationship between possible patterns in the effect estimates 
and the continuity of the underlying variable. These considerations also apply if the 


© The program and data (inits may be generated randomly) are: 

model {# weights according to distribution of confounder among controls 
M[1:2] ~ dmulti(w[1:2], TM) 

w[1:2] ~ ddirch(alpha[1:2]) 

# distribution of male cases over exposure levels (level 1 is zero exposure with no cigarettes smoked) 
a[l, 1:4] ~ dmulti(pi.case[1, 1:4], N[1]) 

# distribution of female cases over exposure levels 

a[2, 1:4] ~ dmulti(pi.case[2, 1:4], N[2]); 

# distribution of male controls over exposure levels 

b[1, 1:4] ~ dmulti(pi.control[1, 1:4], M[1]) 

# distribution of female controls over exposure levels 

b[2, 1:4] ~ dmulti(pi.control[2, 1:4], M[2]); 

for (i in 1:2) {pi.caseļi, 1:4] ~ ddirch(alpha[ ]); pi.controlļi, 1:4] ~ ddirch(alpha[ ])} 
# rate (by sex i) among unexposed 

for (i in 1:2) {SRR.div[i] <- w[i]*pi.case[i, 1]/pi.controlļi, 1] 

# rates by exposure j and sex 

for (j in 2:4) {SRR.top[j, i] <- wļ[i]*pi.caseļi, j]/pi.controlļi, j]} 

for (j in 2:4) {SRR[j] <- sum(SRR.top[j, 1:2)/sum(SRR.div[1:2)} 

Data 

list(alpha=c(1,1,1,1), # total male and female cases 

N=c(649, 60), # total male and female controls 
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Table 10.7 Lung cancer cases and controls by exposure 
(Daily Smoking Levels) 


Cigarettes smoked daily Cases Controls 
Males 0 2 27 
1-4 33 55 
5-14 250 293 
15 364 274 
All levels 649 649 
Females 0 19 32 
1-4 7 12 
5-14 19 10 
15 15 6 
All levels 60 60 


original variable is inherently ordinal, and not only if the categorisation derives from an 
originally continuous risk factor. 

Suppose estimates have been produced by a procedure such as above or by a 
categorical regression (e.g. a logit model with disease outcome in relation to the risk 
variable in category form and with a corner constraint). Suppose the estimated odds or 
incidence ratios attached to category r derive from a population of size N,, or person- 
years total T,, then a weighted regression, linear or non-linear, involving the R — 1 effect 
estimates as dependent variable, is one way to model the trend (Rothman, 1986, p. 337). 
To model the trend, it is necessary to assign scores to each of the R categories. If the 
categorisation involved aggregating over originally continuous data, the average or 
median of a continuous exposure variable within each category might be used. For 
ordinal data it is common to assign uniformly incremented scores, though modelling of 
the cut-points on the underlying scale (see Chapter 3) might be a preliminary step. 

It may be noted that a guide to the extent of non-linear impacts of the risk factor is 
provided by comparing fit measures between a categorical regression in that risk factor 
and a model with a linear trend in the corresponding category scores (Woodward, 
1999). If the categorical regression does not improve markedly in fit over the linear 
scores model, this might be a preliminary to a linear effect model in the original (i.e. 
uncategorised) form of the exposure variable. 

Greenland (1995) advocates spline regression to model dose-response effects. This 
still involves assigning breaks in the risk variable but allows more effectively for a 
changing gradient between disease rate and exposure, as the exposure varies over its 
range. This may be beneficial if there is non-linearity in the dose-response curve 
(Boucher et al., 1998). There are, however, many other approaches to non-parametric 
regression which a Bayesian estimation approach may facilitate (Fahrmeir and Lang, 
2001); efficient sampling for such models may be obtained using the BayesX software 
(http://www.stat.uni-muenchen.de/ lang/bayesx/bayesx.html). 


M=c(649, 60), TM=709, 

# male cases (2, 33, etc) then female cases by exposure (0 cigarettes, 1-4 cigarettes daily, 5-14, 15 ) 
a=structure(.Data=c(2, 33, 250, 364, 19, 7, 19, 15),.Dim=c(2, 4)), 

# male controls (27, 55, etc) then female cases by exposure (0 cigarettes, 1-4 cigarettes daily, 5-14, 15) 
b=structure(.Data=c(27, 55, 293, 274, 32, 12, 10, 6),.Dim=c(2, 4)) 
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10.3.1 Clustering effects and other methodological issues 


Assuming that a categorical breakdown of the exposure variable or dose has been 
adopted, a binomial or Poisson analysis of responses Y, from N, units exposed is a 
standard one in assessing the strength of the risk as dosages X, are varied. However, 
under certain types of sampling, especially if there is nesting of subjects, there may be 
excess variability relative to the binomial. (Similar considerations apply if there are 
multinomial outcomes at each dosage, such as say, no increase in morbidity, some 
increase, or mortality.) 

An example is intra-litter correlation in experiments in developmental toxicology. 
Thus, let there be R levels of toxic exposure X, and j = 1,.. M, litters of size N,; at each 
exposure. Then a dose response analysis of a binary outcome (survival or death) may 
focuses on the outcomes 


Y, ~ Bin(a,;, Ny) 
under a model for the proportions responding such as 
Tj =[1  exp(—Bo — B1X 1! (10.3) 


Tests of dose effect involving the coefficient B; may, however, be affected by the level 
of intra-litter correlation, p(X,.) as dosage changes. Instead suppose the response pro- 
portions are 


Tr = T/r oy] (10.4) 
where 
T, = [X] = exp (Yo yi 47) (10.5) 
w, = w[X;] = exp (èo 81 X;) (10.6) 
and the intra-litter correlation is 
AX)J= a oJ)? (10.7) 


This framework is consistent with a beta-binomial density with possibly overdispersed 
variance relative to the binomial, namely 


Var( Yyl N) = me a Tj)/Ny[l (N; _ 1)/d T ,)] 


Clustering reflects unobserved heterogeneity (frailty) for groups of related subjects. 
In dose-response models involving human subjects, especially in trials involving self- 
administration of treatment drug or placebo, outcomes may also be subject to another 
generally latent influence, namely subject compliance with treatment(s). Variations in 
compliance mean that actual exposure varies in an unknown way (Efron and Feldman, 
1991; Zeger and Liang, 1991). Allowing for compliance amounts to modelling the true 
(and latent) exposure to a treatment on the basis of manifest variables (Dunn, 1999). 
The manifest variables would typically include both measures of compliance (e.g. bio- 
markers for drug absorption) and observed responses to treatment. This leads to 
structural equation models in which the latent exposure underlies both the observed 
outcomes and bio-markers. In clinical trials where there is a placebo group, certain 
additional assumptions may be reasonably made which facilitate analysis. For example, 
if treatment allocation is random, one may sometimes assume that the propensity to 
comply is similar between treated and control groups — even if the placebo group are not 
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exposed to an active drug or treatment. On the other hand, there may be circumstances 
when compliance in the treatment group is related to prognosis (e.g. subjects with 
adverse symptoms may be more likely to comply) (Mark and Robins, 1993). 

Dose-response relations may involve dependencies on both time and the dose itself, 
for example in bioassay where the analysis considers both the level of the mortality rate 
itself and the patterns to times to mortality. In animal experiments, a proportion of the 
animals may not be subject to experimental mortality at all, and go on to live their 
natural life. One might consider a mixture analysis of susceptibles who will die prema- 
turely in the experiment and non-susceptibles (Pack and Morgan, 1990). But the 
analysis may be reduced to modelling the times to mortality of susceptible animals. 
Often a proportional hazard analysis (see Chapter 9) involving multiplicative functions 
of dose and time may be applied. Similar issues may occur in human disease onset or 
mortality, in terms of the joint impact of age at death or onset and a putative exposure 
such as smoking. 

Dose-response relations may also be modified via multi-level approaches in order to 
take account of subject level covariates — and this might be one way to model concomi- 
tant time/age effects (Greenland, 1998b; Wiesinha and Piantadosi, 1995). Consider the 
logistic model in Equation (10.3). Both Bọ and B; may be made functions of subject 
covariates v,(j = 1,..M,) for instance, via 


Bo, = exp (ovj) 
By, = exp (Ài vyj). 


Example 10.5 Breast cancer and radiation Rothman (1986, Chapter 16) and Green- 
land (1995) are among those arguing against assessment of dose-response relations 
using categorical trend analysis. This might, for instance, involve selecting quintile or 
quartile breaks in a continuous risk factor X and then modelling trends in terms of 
category specific rates or odds ratios. The implicit assumption is that risk does not 
change within categories, whereas in fact there is often a trend within the category. One 
might ensure constancy of risk for all practical purposes by taking a larger number of 
categories than conventionally used (e.g. 10 or 20), but then risk estimates for each 
category tend to be unstable — though Bayesian smoothing methods may well attenuate 
this drawback. Alternatives are spline regression (see Example 10.6) which still involves 
selecting breaks in the continuous X, but allows for trends within categories, or regres- 
sion with category scores. For instance, one might take as category scores the average or 
median exposure X, within the rth category and then apply the usual regression 
methods — linear or power models, with a link function as appropriate. 


To illustrate the role of scoring categories of a dose variable, we consider data from 
McGregor et al. (1977) on incidence of breast cancer among women from Nagasaki or 
Hiroshima, and aged 10 in 1945, according to four categories of radiation dose (Table 
10.8). Following Rothman (1986) we can assign mid-category scores 0, 5, 55, and 150 
and use as regression weights the total of person-years exposed T,. If available, median 
or mean exposures within the three upper categories would be preferable options (in 
fact, a mean weighted for person years of exposure would be required). 

The analysis is in terms of rates per 1000 Person-Years (PY). Following Rothman, 
this outcome may be modelled as a continuous score with Normal errors. Note that a 
more natural unit is per million person-years but this implies a very small precision and 
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Table 10.8 Radiation exposure and cancer cases 


Exposure* 0 >0- < 10 10— < 100 100 and over 


Breast cancer cases 38 105 48 34 
Peson-years exposed 208515 463086 164639 52185 
Incidence rate 


per 1000 person-years 0.182 0.227 0.292 0.652 


*> 0 —< 10 means over zero but under 10. 


possible prior specification problems. A weighted regression is used with precisions for 
the rth category modelled as 


P, = pr 


with t an overall precision parameter and T, the person year total for the rth category. 

Following Rothman, a linear model is initially assumed. The linear slope of 0.00269 
in the 1000 PY scale, with 95% interval from 0.0016 to 0.0038, compares to the 
confidence interval cited by Rothman (1986, p. 337) of 1.23 to 3.63 in the 1000000 
PY scale. The predicted mean incidence for an exposure of 100 rads is 0.463 per 1000 
women-years, with interval from 0.37 to 0.56. 

An alternative model is quadratic in exposure as in Example 10.2, while still retaining 
the assumption of a continuous outcome. This yields an improved predictive loss 
criterion of 0.014 (with w = 1) as compared to one of 0.023 with the linear model. 
Note that there is a baseline or background mortality effect here (see Section 10.3.2), 
namely mortality at zero rads, so that prediction of mortality in the first category is not 
improved by power models. 

Another sampling model might take the four observed counts of cases as Poisson with 
means v, = aT,p,, with v, = 0.001 T,u, for an analysis in incidence per 1000 person- 
years. This option in fact suggests the quadratic model is over-parameterised. This 
model yields improved predictions (z.scaled[] in Program 10.5, Model C) of the inci- 
dence rates as compared to the linear model in a continuous outcome, namely 
{0.204, 0.212, 0.308, 0.634} but has less precise predictions of the new data, and so 
has a worse predictive loss. Arguably, though, this approach more appropriately reflects 
the uncertainty in the observed data. 


Example 10.6 Trend in CHD according to SBP, Framingham study The subject 
matter for this example is the CHD onset data for the Framingham cohort, but 
considering various options to improve dose-response modelling over the categorical 
predictor model in Example 10.2. 


The first expedient is a finer subdivision of the SBP variable. This sub-division has six 
levels: under 140, 140-149, 150-159, 160-169, 170-179, and over 180. To assess trend, 
scores are then assigned using the observed mean exposure within categories r = 1,..,6 
and using as weights the observed numbers N, in each category. The odds ratios for the 
six category breakdown are estimated as fixed effects via a logistic regression (Model Al 
in Program 10.6). Relative to the SBP under 140 category, they range from 1.5 for SBP 
140-149 (with 95% credible interval including values under 1) to 4.04 (Table 10.9). 

Subsequently, fitting a weighted linear trend to the six posterior mean odds ratios 
(Model A2 in Program 10.6) gives a slope of 0.04 for the increase in the log OR for every 
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Table 10.9 Odds ratios for successive SBP bands 


Odds Ratio for SBP Mean St. devn. 2.5% 97.5% 


140-149 1.50 0.31 0.95 2.17 
150-159 1.88 0.40 1.24 215 
160-169 2.37 0.61 1.38 3.74 
170-179 3.50 0.91 1.98 5.53 


180 4.04 0.88 2.64 6.01 


unit increase in SBP. For the threshold of 165 mm Hg systolic blood pressure, the odds 
ratio estimated from this trend model is 2.65 with 95% interval from 2.3 to 2.8. The 
central estimate is in fact very similar to the estimate from the continuous predictor 
logistic model (see Example 10.2), but the credible interval is slightly narrower. 

One might also compare the fit of a categorical model with a model linear in the 
scores attached to each category (Woodward, 1999). Model A3 is linear in the averages 
in each SBP band. Whereas the categorical regression in SBP has a DIC of 1277, this 
simpler model has a DIC of 1272 and suggests that non-linearity in the effect of SBP (on 
the logit of the incidence probability) may not be very marked. Note, though, that this 
analysis is conditioned on the break points chosen for the six levels. 

A different perspective on possible non-linearity (Models B and C) is supplied by 
spline and state-space regression methods. As argued by Greenland (1995), spline 
regression may avoid some of the problems associated with redefining a continuous 
exposure into discrete categories. A simple categorical factor approach via logistic 
regression implies (for example) an implausible jump in risk moving from an SBP of 
169 to one of 170. In a spline regression, by contrast, the within category lines have a 
non-zero slope, with smoothness also generated by using quadratic or cubic spline 
terms. Thus, a quadratic spline in SBP would involve linear and quadratic terms across 
the range of SBP, and spline terms operating only when category breaks (knot points) 
are exceeded. 

If Cx denotes the Ath break for SBP, then with notation for SBP, Age and sex as in 
Example 10.2, the spline terms Dx are zero if X < C, and have value Dy = (X — Cy if 
the threshold is exceeded. If there are K category breaks (e.g. K =5 in the above 
analysis for SBP with cut-points Ck being 140, 150, 160, 170 and 180), then up to 
five extra terms are potentially added in the regression. Spline terms in age are 
denoted E, = (Z; — B}? if the threshold B is exceeded and zero otherwise, so a full 
quadratic spline model in age with three cut-points (at ages 50, 55 and 60) would be 
represented as 


BZ, BZ? SF, &E 838; 


A spline model has drawbacks regarding the number and location of knot points. It may 
be wise to start off with a minimal number of knot points, and then add additional ones 
if the data support them. For illustration and to improve identifiability, the quadratic 
terms in Age and SBP across the range (terms such as 8,Z7) are omitted, and single 
knots (i.e. Kı = Ky = 1) adopted in SBP and Age, respectively (with Age and SBP also 
divided by 10 for numeric stability). The knots are at SBP=180 and age 55. The choice 
of these points was based on earlier analyses with more than one knot in both age and 
SBP — though there remains a degree of arbitrariness unless the location of the knot 
point(s) is assumed unknown and itself assigned a prior. 
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With this framework, the odds ratio comparing subjects with SBP above and below 
165 is based on the average SBP for those above and below the threshold. Here we 
consider males aged 50 in these SBP bands, and for males these averages are 185.3 and 
135.4. The fitted logit model for the probability 7 of CHD onset is then 


L=logit(n)=a BZ òE BX Di BZ (10.8) 


where Z; and X are scaled by 0.1. To obtain the relevant odds ratio, it is necessary to 
compare L for males aged 50 in the above and below 165 SBP groups. So Age is set to 50 
in (10.8), and SBP at either 185.3 or 135.4. A profile of age effects (for the 18 ages 45, 46, 
up to 62) at the average SBP of 148 is obtained as 


Expit(a BıZı ðE BX y,Di B32Z2) 


where Expit(L) = exp (L)/[1 exp (L)] is the inverse of the logit transform (Greenland, 
1998a). With start points provided by the 2.5th and 97.5th percentiles of a trial run, 
convergence in a two chain run is obtained at around 1500 iterations. The spline term in 
age is not significant, in line with no clear quadratic effect in age (Table 10.10). There is 
a significant spline term in SBP, and additional terms might be experimented with. The 
posterior median for the odds ratio of 3.02 exceeds those obtained earlier. The pseudo 
marginal likelihood improves over Model A (—628.5 vs. —639). 

The profile of CHD rates at various levels of SBP (for a male at average age 52.4) 
obtained from Model B shows a tailing off in the increased risks of CHD at very high 
SBP (Figure 10.2). This profile is stored in the vector SBP.eff[] in Program 10.6. Its 
shape may be an artifact of the data in that there happen to be no CHD cases among a 
small number (5) of subjects with SBP of 280 and over. 

A final form of analysis of nonlinear risk effects (Model C) is relatively straightfor- 
ward in terms of estimation through Bayesian sampling, and might serve as a basis for 
selecting knot points in a subsequent spline analysis. This model uses a form of state- 
space prior, and involves a large number of age and SBP categories, specifically 18 age 
categories and 43 SBP categories (in intervals of five on the original scale so that the 
groups are 90-95, 95-100, etc., up to 295-300). If B; denotes the age parameters and y; 
the SBP group parameters, then a random walk prior is assumed, such that 


B; ~ N(B;_1, Tg) j=2,..,18 
y~ NO- Ty) J=2...43 


To assist identification, it is assumed that B; = y, = 0. A two chain run shows conver- 
gence after 2000 iterations and the second half of a run of 5000 iterations produces 
gradients in the log-odds parameters for age and SBP groups (relative to the baselines, 


Table 10.10 Quadratic spline in age and SBP, parameter summary 


Mean St. devn. 2.5% Median 97.5% 


Odds ratio 3.03 0.47 2.23 3.02 4.09 
a —7.92 0.88 —9.60 —7.91 —6.13 
Bı 0.51 0.16 0.21 0.51 0.83 
B2 0.22 0.03 0.16 0.22 0.28 
B3 0.92 0.15 0.63 0.92 1:22 
òi —0.021 0.525 —1.039  —0.022 1.019 


yı —0.020 0.009 —0.038 —0.020 —0.004 
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Figure 10.2 SBP profile of CHD rate (at average age) 


age 45 and SBP of 90-95) as in Figures 10.3 and 10.4. This model improves over model 
A in terms of pseudo-marginal likelihood (—632.5 vs. —639), but has a broadly similar 
predictive loss measure (300.2 for w = 1 compared to 300.5 for Model A). Increasing w 
(e.g. to 1000) does, however, give a larger fit advantage to Model C — increasing w tends 
to downweight relatively imprecise predictions and emphasize the match between the 
actual data and the posterior means of the new data. 

The results above use G(1, 0.01) priors on 1/tg and 1/7t,, and results of this type of 
analysis may well be sensitive to the priors on these parameters — as they govern the 
degree of smoothing. The posterior means for 73° and 7° are 0.11 and 0.20. If, instead, 
G(5, 0.05) priors favouring relatively high precision are adopted, the posterior mean of 
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Age 


Figure 10.3 Age parameters 
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Figure 10.4 SBP parameters 


DA is reduced to 0.15, but the shape of the log-odds curve is very similar to Figure 10.4. 
From Figure 10.4, a final knot point at around 250 might be selected and the spline in 
the upper category restricted to be linear (see Greenland, 1998a) to avoid the implaus- 
ible fall in risk at very high SBP. 


Example 10.7 Cumulative mortality in relation to dose-time The previous example has 
illustrated how standard model assumptions (e.g. linearity in dose effects) may need to 
be critically examined. As mentioned above, dose-response modelling one may also 
need to consider the joint action of dose with other (confounding) factors, as well as 
departures from standard sampling assumptions (e.g. clustering as a source of binomial 
overdispersion). 


To illustrate an alternative modelling structure, drawing on survival analysis concepts 
to represent the joint effects of dose and a (confounding) time index, consider an animal 
experiment reported by Pack and Morgan (1990). This involves deaths over a 13 day 
period among flour beetles sprayed with the insecticide pyrethrins B, where the focus is not 
only on endpoint or total mortality by the end of the period, but on cumulative mortality 
at days 1, 2,3,.., up to 13. Four dosage levels (mg/cm?) were applied, 0.20, 0.32, 0.50 and 
0.80. The relation of mortality to dose-time is expected to be differentiated by the sex k of 
the beetle. It may be noted that the structure of this example could be applied with 
suitable modifications in Example 10.6 (with SBP as the dose, and age parallel to time). 

The probability of death at dose X, in the jth time interval {t)-1, tj}, where 
to = 0, t) = 1,..t)3 = 13 is modelled as 


ay = Clty, X,) — C(t, X) (10.9a) 
where 
1/C, X)=[1 exp{—B, — Blog (WH tg] (10.9b) 


Thus, separate dose-response and time-response models are present in Equation (10.9b). 
As an illustration of the potential benefits of parameter transformation, an alternative 
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model expresses By as exp (y4). Deaths are additionally classified by sex k, so this model 
involves parameters {B;1, Byo, By3, B4}, though selective equality of parameters might 
be investigated as one way to model simplification. As noted by Morgan (2000), the 
function in time in Equation (10.9b) is the cumulative density of a log-logistic, and for 
susceptible animals is the cumulative density of mortality over time, regardless of dose. 

Prior substantive knowledge suggests the parameter B, may be larger than the others, 
and so the multivariate Normal precision matrix allows a wider range in its value. 
Its prior mean is set at 20 for both males and females. In the alternative parameter- 
isation the y,4 have prior means of 3, and variances 10. In the original parameterisation, 
convergence with a three chain model is not obtained after 150000 iterations, with the 
parameter B,, failing to converge. By contrast, in the alternative parameterisation 
convergence occurs in all parameters by 20000 iterations and the summary in Table 
10.10 is based on the iterations 20 000-30 000. 

The deaths data for higher day numbers are rather sparse, and asymptotic consider- 
ations applied in classical tests and often in simple Bayes assessments of fit (e.g. via AIC 
and BIC criteria) will be of some doubt (Pack and Morgan, 1990, p. 752). Therefore, 
bootstrap principles and/or predictive probability checks might be applied to assess fit. 
Accordingly we find the likelihoods of actual and replicate data to be closely compar- 
able, and a predictive check to average about 0.34. A similar conclusion is reached by 
Pack and Morgan (1990), using Monte Carlo testing applied to the binomial deviance. 
They obtain a deviance of 92 for the model (10.9), and a range of deviances in 100 
replicate data sets (sampled from the ML estimate and fitted with the same model) from 
76 to 130, with a mean of 90 and variance of 150. They conclude that the fitted model is 
consistent with the simulated data. 

Values of the dose and time profile parameters by sex of flour beetle, as in Table 
10.11, are similar to those presented by Morgan (2000, p. 89). Note that the parameter 
Bi2 has an asymmetric posterior density, and with a flatter prior than the one used in 
Program 10.7 may become unstable. Morgan (2000) obtains a maximum likelihood 
estimate for Bı) of 3.37 with standard error 0.33, and a 95% interval (assumed to be 
symmetric) from 2.72-4.02. It can be seen that a Bayesian analysis has an advantage in 
representing parameters that in fact have asymmetric densities. 


Example 10.8 Boric exposure Examples 10.6 and 10.7 have considered Bernoulli/ 
binomial sampling without allowing for overdispersion due to clustering or frailty 


Table 10.11 Flour beetle mortality, parameter summary 


Males Mean St. devn. 2.5% Median 97.5% 


Bii 4.99 1.36 3.26 4.82 8.86 
Biz 3.59 0.88 2.44 3.48 6.04 
Bi3 2.71 0.20 2.34 2.71 3.09 
V4 2.70 0.23 2.27 2.69 3.15 
Females 

Boy 2.83 0.81 1.39 2.75 4.66 
Boo 2.75 0.59 1.67 2.71 4.04 
B23 3.45 0.30 2.88 3.45 4.06 


Yoa 4.00 0.38 3.30 4.00 4.77 
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Table 10.12 Boric acid exposure and in-utero damage 


Mean St. devn. 2.5% Median 97.5% 


So 4.12 0.73 2.77 4.09 5.58 
òi —7.67 1.92 —11.18 —7.56 —3.80 
Yo 1.34 0.72 —0.05 1.31 2.80 
yı —4.74 2.08 —8.93 —4.66 —0.70 
Ti 0.058 0.010 0.040 0.058 0.080 
T2 0.074 0.009 0.058 0.074 0.095 
T3 0.094 0.011 0.075 0.093 0.116 
T4 0.130 0.026 0.083 0.128 0.186 
Pi 0.019 0.014 0.004 0.016 0.056 
P2 0.036 0.020 0.010 0.032 0.085 
P3 0.068 0.028 0.028 0.063 0.136 
P4 0.231 0.065 0.121 0.227 0.374 


effects. One might, for instance, in Example 10.6, introduce subject level random effects 
to represent unmeasured influences on CHD onset, though this would considerably 
slow estimation. To illustrate clustering as a source of binomial overdispersion and 
another source of departure from standard modelling assumptions, consider an experi- 
ment to assess in-utero damage to mice following exposure to boric acid (Slaton et al., 
2000). This is a constituent of many household products with suspected risks to humans. 
The exposures in terms of percent of acid in the mice feed were X, = 0 (control), 
X = 0.1, X = 0.2 and X4 = 0.4, and the numbers of litters at each dose level were 
Mı = M2 = M; = 27 and M4 = 26. The outcome is the number dead Y,; among litters 
of size N,;, with j = 1,.. M, and r = 1,..,4. 


A three chain run with over-relaxation shows convergence at round 30 000 iterations 
in the four parameters {Bo, By, Yo. Yı} in Equations (10.5)-(10.6). This gives estimates 
of these parameters, and of the intra-litter correlations, as in Equation (10.7), based on 
iterations 30 000-40 000 (Table 10.12). 

These show the correlation as most pronounced at the highest dosage d4, and cursory 
examination shows the high death rates within certain litters at this dose level. By 
comparison, the maximum likelihood and their standard errors as estimates obtained 
by Slaton et al. (2000) were 


Yo = 1.54(1.10) 
yı = —5.19(3.07) 
So = 4.33(1.10) 
8, = —8.26(2.99) 


The beta-binomial mean proportion is 
T(X,) = t/t, | =[(80 yo) exp — XT! 


so that the standard logit dose-response model is, from Equation (10.3), equivalent to 
letting Yọ — 59 = Bo and yı — 8; = Bı. An absence of linear effect of dose in the mean 
proportions is then equivalent to 6; = yı. One might therefore apply (in a sampling 
framework) a test of whether y; > ò, with a posterior probability over 0.95 or under 
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0.05 being broadly equivalent to rejecting y; = ô. A high probability that y; > 8; is 
consistent with an increasing dose-response. 

Also, when t(X,) and w(X,) are exponential functions of dose, the intra-litter correl- 
ation is 


pAXJ=(1 exp(yo YX) expo 8X)! 


so an absence of dose-response effect in both correlation and mean only occurs if 
yı = 6 = 0. If this happens then pLX,] = p and the intra-litter correlation is constant. 
The hypothesis of constant correlation might be assessed by monitoring whether 
pLX,] > pLX;] over pairs r,s. A rejection of p[X,]=p also amounts to rejecting 
p[X,] = 0, which is the condition required for the standard binomial sampling model 
to be applicable. 

Both inequalities y; > 6; and p[X,] > pLX;] over pairs r >s were confirmed with 
probability 1 (i.e. no samples were exceptions to these inequalities over iterations 
30 000-40 000). Hence, there is both a dose effect and extra-binomial variation. One 
might assess these features via model fit criteria. Thus, the predictive loss criterion of 
footnote 3 is 267 under the binomial (with w = 1), but considerably lower at 174 under 
the clustered binomial. The latter model provides a much improved fit of deaths at the 
highest dose, for instance of Yg2 = 12, with posterior mean vg) = 9.8 under the clustered 
model against vg) = 2.2 under the binomial. 

Slaton et al. point out a high correlation between y; and 8; and suggest an alternative 
parameterisation involving the parameters y; and B; = y; — ô. Adopting this here 
(Model C in Program 10.8) shows faster convergence (around 15000 iterations with 
over-relaxation) and the correlation between y; and B; is only —0.07. 


Example 10.9 Compliance and response An illustration of dose-response modelling 
approaches where compliance is an issue is provided by simulated data from Dunn 
(1999). Here n = 1000 subjects are randomly divided in a 50:50 ratio between control 
and treated groups, with the outcome Y being a function of a latent true exposure F. 
The treated group has a higher coefficient on the true exposure than the control group 
in this simulation. Two fallible indicators C1, C2, of the compliance (e.g. bio-markers 
for active or placebo drugs) latent exposure are available. The first of these provides a 
scale for the unknown exposure F that is taken be centred at u. The second has 
coefficient y = 1 on F. The observed outcome Y is also related to the latent exposure, 
with the impact of F on Y allowed to differ according to assignment to treatment or 
otherwise. 


Specifically, the simulated data is generated according to 


Yi=a Bafi Ni 


Cy =F; ey 

(10.10) 
Cy = YF; ej 
F= h. Uj 


where p = 70, a = 50, B, = 4 for treated subjects (G; = 2) and B, = 1 for control 
subjects (G; = 1). The variances of the normally distributed errors N;, e1;, ez; and uj 
are, respectively, tT, = 225, T1 = 144, t2 = 225 and T, = 225 and their means are zero. 
The model is re-estimated knowing only Y, Ci, Cy and G. Both in setting priors on 
precisions and intercepts, and in sampling inverse likelihoods (to estimate CPOs), it is 
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preferable to scale the data by dividing Y, Cı and C, by 100. Otherwise, the variances 
are large and their estimation sensitive to prior assumptions. Initially, the same dose- 
response model as in Equation (10.10) is assumed, except that in contrast to the 
generating model, differential variances T,; and 1,2 of the errors n; are adopted, 
according to patient treatment group, so that 


Yi=a Bafi Nia, 


G(1, 0.001) priors on bp; = 1/tj, 6; = 1/7; and , = 1/7, are adopted. A three chain 
run with over-relaxation shows convergence at around 500 iterations, and the summary 
in Table 10.13 is based on iterations 1000-5000. The original parameters are reasonably 
accurately reproduced, when account is taken of the scaling. 

In a second model, the C; are taken as centred at v; and F to have zero mean and 
variance 1. This approach might be one among several options adopted in ignorance of the 
generating model. Because the variance of F is known, slopes of Cı and C, on F may be 
estimated. Additionally, the intercept of Y is taken as differentiated by treatment. Thus 


Yi=0G, Bati Nii, G; 
Ci=v YF; eli 
Cy =v YFi eri 


This model has a considerably improved predictive loss criterion (0.32 vs. 0.54 for the 
first model, when w = 1), but slower convergence. The treatment differential on the 
effect of the latent exposure is still apparent, with B, over four times that of B4. 


Table 10.13 Compliance and latent exposure: parameter 
summary 


Mean St. devn. 2.5% Median 97.5% 


lst model 

a 0.55 0.05 0.44 0.55 0.64 
Bi 0.94 0.07 0.80 0.94 1.08 
B2 3.94 0.07 3.79 3.93 4.09 
y 0.995 0.009 0.978 0.995 1.012 
u 0.701 0.006 0.689 0.701 0.713 
TI 0.015 0.001 0.013 0.015 0.017 
T2 0.024 0.001 0.021 0.023 0.026 
Tri 0.021 0.002 0.017 0.021 0.025 
TM 0.025 0.013 0.002 0.025 0.049 
Tu 0.022 0.001 0.019 0.022 0.025 
2nd model 

ar 1.21 0.02 1.19 1.21 1.22 
a 3.30 0.04 3.26 3.30 3.35 
Bi 0.134 0.015 0.116 0.134 0.152 
Bo 0.603 0.027 0.559 0.604 0.644 
yı 0.139 0.011 0.128 0.139 0.150 
Yo 0.155 0.011 0.142 0.154 0.168 
vı 0.702 0.012 0.691 0.703 0.714 


v2 0.696 0.013 0.682 0.696 0.710 
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10.3.2 Background mortality 


A standard modelling assumption in controlled trials and laboratory experiments is that 
the responses of test subjects are due exclusively to the applied stimulus. In dose- 
response models this assumption means that the control probability (i.e. for subjects 
with no dose) of response is zero. However, multinomial or binary responses (e.g. for 
type of defect or for mortality) for such trials, where one or more intervention or 
treatment has been performed, may be subject to a background mortality effect. Such 
nonzero control response may need to be allowed for in dose-response studies. 

At the simplest, consider a binary mortality or other response Y with the probability 
that Y = 1 modified to take account both of the chance of a background event and the 
chance of a dose-induced event. Thus, let a and P(X) denote the respective chances of a 
background and treatment induced response, with corresponding random variables 


Yg ~ Bern(a) 
and 
Yu ~ Bern(P(X)} 
Then the overall probability that Y = 1 given a dosage X is a binary mixture 
Pr(Y =1|X)=a (l—a)P(X) 
This model has the effect of concentrating the dose-response curve modelled by P(X) 


from (0, 1) into the range (a, 1). If Y is polytomous without ordering, or ordinal, and 
contains S_ 1 categories, then Yg and Yy are multinomial, with 


Pr(Yg = S)= a,(s = 0, 1,..S) 
s 
where X` a, = | and 
0 


Pr(Yy >s|X)=H(x; BX) s=1,..8 


10.11 
=1 s=0 ( ) 


where H is an inverse link and the dose effect is linear for the assumed link. This defines 
a proportional odds model for Yọ with cut points K, that are monotonically declining. 


Example 10.10 Arbovirus injection This example involves the ordinal response data 
on deformity or mortality in chick embryos as a result of arbovirus injection (Xie and 
Simpson, 1999). Two viruses, Tinaroo and Facey’s Paddock, were investigated, with 72 
and 75 embryos, respectively, receiving these viruses. A further 18 embryos received no 
virus. 


There are S 1 =3 outcomes: survival without deformity, survival with deformity, 
and death. There is one death (i.e. background mortality) among the controls. For the 
g = 1, 2 treatments (Tinaroo, Facey’s Paddock), the probabilities of the responses may 
be expressed 


Pr (Ym > s| X) = A (kgs BX) s=1,..8 
=1 s=0 


For the Tinaroo group, there were four dosage levels (in inoculum titre in terms of PFU/ 
egg), namely 3, 20, 2400 and 88000. For the Facey’s Paddock group the doses were 3, 
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18, 30 and 90. These doses are subject to a logl0 transform. We adopt the previous 
paramaterisation of the proportional odds model (see Chapter 3), with appropriate 
constraints on the kgs. 

Follwing Xie and Simpson, the baseline mortality effect for the control group is taken 
to be binary rather than multinomial (excluding the option of survival without treat- 
ment induced deformity), and so only one parameter a is required. Note also that, to use 
the predictive loss criterion (footnote 3), it is preferable to use multinomial sampling 
where the data are dummy indicators y; = 1 if Y; = j and yip = 0, k # j — as opposed to 
direct categorical sampling using Y as the data and the dcat() function. The two are 
equivalent ways of modelling the data. 

N(O, 100) priors are adopted on the Kgs, N(0, 10) priors on the B, parameters’, and 
a B(1, 1) prior on a. A three chain run then shows convergence at 2500 iterations and 
the summary (Table 10.14) is based on iterations 2500-10000. The mean posterior 
probability of background embryo mortality (from natural causes) stands at 0.13 
compared to an estimate of 0.11 obtained by Xie and Simpson. The mortality rate is 
higher in the Tinaroo group as dosage increases and there are few surviving with 
deformity, whereas the Facey’s Paddock group have relatively more embryos surviving, 
albeit with deformity. Accordingly, the B dose effect parameter is stronger for Tinaroo 
embryos and there is only a small difference in cut points k2; and k22 comparing the 
combined response of survival with deformity and death and the death response 
considered singly. 

A second model introduces nonlinear effects in dose (adding a term in 1/X), so that 


Pr(Yu > sld) = H(kgs BX Y/X) s=1,..8 
= 1 s=0 


Note that as X increases 1/X declines so a negative effect on 1/X is equivalent to X 
increasing risk. An N(0, 10) prior on the y parameters is adopted. The analysis produces 
a negative effect on 1/X only for the first group, with mean (and standard deviations) on 
Bı and y; being 2.3 (0.8) and —2.3 (1.1), respectively. For the second group, the 
coefficient on 1/X is positive. This model produces no improvement in the predictive 
loss criterion (with w = 1) over the linear dose model, namely 85.4 as against 84.2, 
although both B, and y, coefficients are significant. 


Table 10.14 Arbovirus injection and chick embryo damage: 
parameter summary 


Mean St. devn. 2.5% Median ’97.5% 


a 0.13 0.05 0.05 0.13 0.23 
Bi 2.27 0.66 1.28 2.14 3.88 
B2 3.48 1.19 1.69 3.26 6.04 
Kil —6.21 1.92 —10.98 —5.85 —3.35 
K12 —11.04 3.38 —19.31 —10.35 —6.11 
K21 —5.08 1.83 —9.02 —4.74 —2.32 


K22 —5.37 1.85 —9.35 —5.03 —2.56 


7 A more diffuse N(O, 100) prior on the B, led to convergence problems. Moderately informative priors may 
be justified in terms of likely bounds on relative mortality between treatments or between treatment and 
baseline mortality. 
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10.4 META-ANALYSIS: ESTABLISHING CONSISTENT ASSOCIATIONS 


Meta-analysis refers to methods for combining the results of independent studies into 
effectiveness of medical treatments, or into the impact of environmental or other health 
risks, and so form a prior evidence base for planning new studies or interventions 
(Hedges and Olkin, 1985). While randomised trials are the gold standard evidence for 
meta-analysis (e.g. on medical treatment effectiveness), meta-analysis may use other 
study designs, such as cohort and case control studies. The typical Bayesian approach 
aims at estimating underlying ‘true’ treatment or study effects, defined by random 
deviations from the average effect. If observations on each study include an outcome 
rate for a control and treatment group, then one may also model the average risk level 
or frailty of subjects in each trial. 

Several possible outcomes may be considered as summarising study or trial results: 
examples are differences in proportions responding between treatment and control 
groups, the ratio of odds responding, or the ratio of proportions responding. With 
regard to previous sections, one might also pool the slopes of dose-response curves 
(DuMouchel and Harris, 1983) or odds ratios after allowing for confounders. DuMou- 
chel (1996) presents an example of combining odds ratios from different studies, 
where studies differ in whether their odds ratio estimate controls for confounders. 
Whether or not the ith study did control for a given confounder defines a set of binary 
covariates that influence the estimates of underlying study effects in the meta-analysis 
over studies. 

Bayesian methods may have advantages in handling issues which occur in meta- 
analysis, such as choice between fixed-effects vs. random-effects models, robust infer- 
ence methods for assessing small studies or non Gaussian effects, and differences in 
underlying average patient risk between trials. Further questions which a Bayesian 
method may be relevant include adjusting a meta-analysis for publication bias, meta- 
analysis of multiple treatment studies, and inclusion of covariates (Smith et al., 1995; 
Carlin, 1992; DuMouchel, 1990; Prevost et al., 2000). Thus, whereas most medical meta- 
analyses involve two treatment groups (or treatment vs. control), Bayesian techniques 
can be used to compare either of the two main treatments with a common third 
treatment to improve estimation of the main treatment comparison (e.g. Hasselblad, 
1998; Higgins and Whitehead, 1996). Publication bias occurs if studies or trials for 
meta-analysis are based solely on a published literature review, so that there may be a 
bias towards studies that fit existing knowledge, or are statistically significant. 

The simplest meta-analysis model is when effect measures y;, such as odds ratios for 
mortality or differences in survival rates for new as against old treatment, are available 
for a set of studies, together with estimated standard error s; of the effect measure. For 
example, consider the log odds ratio as an effect measure. If deaths a; and b; are 
observed among sample numbers r; and t; under new and old treatments, then the 
odds ratio is 


{ai/ (ri — ai)} /4bi/(ti — bD) 


The log of this ratio may (for moderate sample sizes) be taken as approximately normal, 
with variance given by 


s=1fa 1/fiza) 1/bi 1/(t— bi) (10.12) 


Under a fixed effects model, data of this form may be modelled as 
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yi ~ N(u, 8?) 


where u might be estimated by a weighted average of the y; and the inverses of the s? 
used as weights (since they are approximate precisions). Under a random effects model 
by contrast, the results of different trials are often still taken as approximately Normal, 
but the underlying mean may differ between trials, so that 


yi ~ Ni, 57) (10.13) 


where v; = » ò; and the deviations ô; from the overall mean p, representing random 
variability between studies, have their own density. For example, if the y; are empirical 
log odds, then p is the underlying population log odds and the deviations around it 
might have prior density 


3; ~ N(O, 7’) 


The rationale for random effects approaches is that at least some of the variability in 
effects between studies is due to differences in study design, different measurement of 
exposures, or differences in the quality of the study (e.g. rates of attrition). These mean 
that the observed effects, or smoothed versions of them are randomly distributed 
around an underlying population mean. 

We may make the underlying trial means functions of covariates such as design 
features, so that 


For instance, as mentioned above, DuMouchel (1996) considers odds ratios y; from nine 
studies on the effects of indoor air pollution on child respiratory illness. These odds 
ratios were derived within each study from logistic regressions, either relating illness to 
thresholds of measured NO, concentration in the home, or relating illness to surrogates 
for high NO; (such as a gas stove). Thus, four of the nine studies actually measured NO, 
in the home as the basis for the odds ratio. In deriving the odds ratio, two of the nine 
studies adjusted for parental smoking, and five of the nine for the child’s gender. Thus, 
in the subsequent meta-analysis, we can derive dummy indicators z; for each study 
which describe the ‘regression design’, or confounders allowed for, in deriving the odds 
ratio. 


10.4.1 Priors for study variability 


Deriving an appropriate prior for the smoothing variance 1? may be problematic as flat 
priors may oversmooth — that is, the true means v; are smoothed towards the global 
average to such an extent that the model approximates the fixed effects model. While 
not truly Bayesian, there are arguments to consider the actual variability in study effects 
as the basis for a sensible prior. Thus DuMouchel (1996, p. 109, Equation (5)) proposes 
a Pareto or log-logistic density 


a(t) = 5/(so T (10.14) 


where s = n/¥s;* is the harmonic mean of the empirical estimates of variance in the n 
studies. This prior is proper but highly dispersed, since though the median of the density 
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is So, its mean is infinity. The (1, 25, 75, 99) percentiles of t are 59/99, 50/3, 350, 9950. In 
BUGS the Pareto for a variable T is parameterised as 


TrackT © dD 


and to obtain the DuMouchel form involves setting a = 1, c = so, and then tT = T — sọ. 

Other options focus on the ratio B = 77/(t? 4) with a uniform prior one possibility. 
The smaller is 7? (and hence B), the closer the model approximates complete shrinkage 
to a common effect as in the classical fixed effects model. (This is obtained when 7? = 0.) 
Larger values of B (e.g. 0.8 or 0.9) might correspond to ‘sceptical priors’ in situations 
where exchangeability between studies, and hence the rationale for pooling under a 
meta-analysis, is in doubt. One might also set a prior directly on 7? directly without 
reference to the observed s?. For instance, one may take the prior tT? ~ x7(v)/v, with 
the degrees of freedom parameter at values v = 1, 2 or 3 being typical choices. For a 
meta-analysis involving a relatively large number of studies, or studies with precise 
effects based on large samples, a vague prior might be appropriate, e.g. 


+t? ~ G(0.001, 0.001) 


as in Smith et al. 

Smith et al. (1995, p. 2689) describe how a particular view of likely variation in an 
outcome, say odds ratios, might translate into a prior for 7”. If a ten-fold variation in 
odds ratios between studies is plausible, then the ratio of the 97.5th and 2.5th percentile 
of the odds ratios is 10, and the gap between the 97.5th and 2.5th percentiles for 8; 
(underlying log odds) is then 2.3. The prior mean for 7? is then 0.34, namely 
(0.5 x 2.3/1.96), and the prior mean for 1/7? is about 3. If a 20-fold variation in 
odds ratios is viewed as the upper possible variation in study results, then this is taken 
to define the 97.5th percentile of 7? itself, namely 0.58 = (0.5 x 3/1.96)*. From this the 
expected variability in 1? or 1/7? is obtained’. 


Example 10.11 Survival after CABG An example of the above random effects meta- 
analysis framework involves data from seven studies (Yusuf et al., 1994) comparing 
Coronary Artery Bypass Graft (CABG) and conventional medical therapy in terms of 
follow-up mortality within five years. Patients are classified not only by study, but by a 
three-fold risk classification (low, middle, high). So potentially there are 21 categories 
for which mortality odds ratios can be derived; in practice, only three studies included 
significant numbers of low risk patients, and an aggregate was formed of the remaining 
studies. 


Verdinelli et al. (1996) present odds ratios of mortality, and their confidence intervals 
for low risk patients in the four studies (where one is an aggregate of separate studies), 
namely” 2.92 (1.01, 8.45), 0.56 (0.21, 1.50), 1.64 (0.52, 5.14) and 0.54 (0.04, 7.09). The 
empirical log odds y; and their associated s; are then obtained by transforming the 


8 The upper percentile of 7? defines a 2.5th percentile for 1/1? of 1/0.58=1.72. A G(15, 5) prior for 1/7? has 
2.5th percentile of 1.68 and mean 3, and might be taken as a prior for 1/1”. If a hundredfold variation in odds 
ratios is viewed as the upper possible variation in study outcomes, a G(3, 1) prior is obtained similarly. 

° The standard deviations of the odds ratios would usually have been derived by considering numbers 
(ai, bi, Si ti) as in Equation (10.12) and exponentiating the 95% limits of the log-odds ratio. The original 
numbers are not, however, presented by Verdinelli et al. (1996). 


432 MODELLING AND ESTABLISHING CAUSAL RELATIONS 


above data on odds ratios and confidence limits. With a random effects model, a flat 
prior on the parameter 7? may lead to over-smoothing. To establish the appropriate 
degree of smoothing towards the overall effect u, we first adopt the (weakly) data based 
prior (10.14) previously suggested by DuMouchel (1996). 

A three-chain run for the low risk patient data shows early convergence. From 
iterations 5000-100 000 the estimated of the overall odds ratio in fact shows no clear 
benefit from CABG among the low risk patients (Table 10.15). The chance that the 
overall true effect is beneficial (i.e. that the pooled odds ratio u exceeds 1) is 0.699. The 
deviance information criterion for this model, which partly measures the appropriate- 
ness of the prior assumptions, is 11.35. 

A second analysis adopts a uniform prior on 77/(7” s). This leads to a posterior 
mean for the overall odds ratio of 1.40 with 95% credible interval {0.25, 3.24}. The 
DIC is slightly improved to 10.9. Finally, as in DuMouchel (1990), the prior 
tT? ~ x?(v)/v is taken with v = 3. This amounts to a 95% chance that 1? is between 
0.32 and 13.3. This yields a lower probability that the overall odds ratio exceeds 1, 
namely 0.6, but the posterior mean for the overall effect is slightly higher at 1.52, 
with 95% interval {0.29, 4.74}. The DIC is again 10.9. The posterior median of 7? is 
0.73. 

Note that a relatively vague prior such as tT? ~ G(0.001, 0.001) or t~? ~ G(1, 0.001) 
leads to an overall odds ratio estimate with very large variance and essentially no 
pooling of strength: under the latter, the posterior 95% intervals for the odds ratios 
{0.9, 7.57}, {0.23, 1.65}, {0.52, 4.77} and {0.07, 6.06} are very similar to the original 
data. The DIC under this option worsens to 11.6. 


Example 10.12 Thrombolytic agents after myocardial infarction An illustration of a 
meta-analysis where pooling of information is modified to take account of covariates is 
provided by mortality data from nine large placebo-control studies of thrombolytic 
agents after myocardial infarction, carried out between 1986 and 1993 (Schmid and 
Brown, 2000). Such covariates (if they have a clear effect on the trial outcome) mean the 
simple exchangeable model is no longer appropriate. In the thrombolytic studies, 
mortality rates were assessed at various times in hours ¢; between chest pain onset and 
treatment, ranging from around 45 minutes to 18 hours. 


The treatment effects y; are provided as percent risk reductions, 
100 — 10011); /1; 


where mı; is the treatment death rate and mm); is the control death rate (Table 10.16). 
Hence, positive values of y show benefit for thrombolytics. Schmid and Brown provide 


Table 10.15 CABG effects in lowest risk patient group 


Study Mean St. devn. 2.5% Median 97.5% 


1. VA 1.98 1.16 0.75 1.67 5.07 
2. EU 0.99 0.45 0.32 0.92 2.05 
3. CASS 1.53 0.77 0.59 1.36 3.50 
4. OTHERS 1.34 1.06 0.23 1.15 3.70 


Meta Analysis (Overall Effect) 1.41 1.23 0.45 1.25 3.20 
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confidence intervals for these effect measures, so that sampling variances s? can be 
derived. In fact, they assume a model with constant observation variance, 
; ~ NQi, o°) 
= aa (10.15) 
Pre N(p; T ) 


Alternate models for u; are a constant regression ignoring the time covariate, w; = Yo 
and a regression model 


Mi=Yo Yí 


where ¢; is as in the third column of Table 10.16. We also consider a constant regression 
model u; = Yo, in which the sampling variances are taken equal to their observed values, 
so that 


yi ~ NW, 57) (10.16) 


Consider first the model (10.15) with a common sampling variance. Here the observa- 
tions y; on the underlying p; are distorted by measurement error, and one may assume 
that 7? < o°, or equivalently 1/1? > 1/07. This is achieved introducing a parameter 
a ~ B(1, 1), and then dividing 1/0? by m, where 1/o? ~ G(1, 0.001). Under the empir- 
ical sampling variance model in Equation (10.16), a DuMouchel prior for 7 is taken. 

With a constant only regression, both models show early convergence in two chain 
runs, and inference is based on iterations 1000-20 000. The first option shows 7? around 
85, the second has 7? around 45. The underlying treatment effects accordingly vary more 
widely under Equation (10.15), namely between 7.4 and 30.6, whereas under Equation 
(10.16) they are between 17.6 and 28.5. The mean percent risk reduction yp is estimated 
as 19.5 under Equation (10.15) and 21.3 under Equation (10.16). The DIC is lower 
under model (10.16), namely 206.9 as against 209.7. 

Introducing the time covariate, together with the common sampling variance assump- 
tion in Equation (10.15), shows that longer time gaps between onset and treatment 
reduce the mortality improvement. The mean for y, is —1.2 with 95% interval 
{—2.3, —0.2}. Pooling towards the central effect is considerably lessened, and trial 
arms with longer time gaps (studies subsequent to ISIS-2 at 9.5 hours in Table 10.15) 
do not show a conclusive mortality benefit. Specifically, the 95% credible intervals for 
the corresponding v; include negative values, though the means are still positive. 
Adopting an alternative prior for the study effects 


vi ~ t(j, T) 


slightly enhances the contrasts in posterior means v;, but still only four studies show no 
mortality reduction. 


Example 10.13 Aspirin use: predictive cross-validation for meta analysis DuMouchel 
(1996) considers predictive cross-validation of meta-analysis to assess model adequacy 
(e.g. to test standard assumptions like Normal random effects). His meta-analysis 
examples include one involving six studies into aspirin use after heart attack, with the 
study effects y; being differences in percent mortality between aspirin and placebo 
groups. The data (in the first two columns of Table 10.17) include standard errors s; 
of the differences, and the pooled random effects model takes the precision of the ith 
study to be s7?. A Pareto-type prior for t, as in Equation (10.14), is based on the 
harmonic mean of the s?. The model is then 


Ic Lea £0 oET'O ocT'O 916 6I SLS vil CsI £661 SVYUAWA 

LE Ga S'6I 80T`0 L800 LCI cel VCCI 901 CsI 8861 c-SISI 

ET LIZ E'S T600 L80°0 SESTI 891 OLLI YSI gI £661 ALYT 
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Ver NW, s?) 
vi ~ N(p, 7’) 


It can be seen from Table 10.17 that one study (AMIS) is somewhat out of line with the 
others, and its inclusion may be doubted on grounds of comparability or exchangeabil- 
ity; this study may also cast into doubt a standard Normal density random effects meta- 
analysis. 

Such a standard meta-analysis using all six studies shows some degree of posterior 
uncertainty in t. A two chain run to 10000 iterations, with convergence by 1000 
iterations, shows a 95% interval for t ranging from 0.06—3.5. In five of the six studies 
the posterior standard deviation of v; is smaller than s;, but for the doubtful AMIS 
study this is not true — compare sd(v6) = 0.95 with the observed sẹ = 0.90 in Table 10.17. 
There is greater uncertainty about the true AMIS parameter than if it had not been 
pooled with the other studies. Despite the impact of this study the overall effect has 
posterior density concentrated on positive values, with the probability Pr (u > 0) being 
0.944. 

A cross-validatory approach to model assessment then involves study by study 
exclusion and considering criteria such as 


Uy = Pr(yt < yelyl— kD = [Pr Git < yl, yi- kPr(Oly[- kdo 


where y[ — k] is the data set omitting study k, namely {y1, V2, ..Yk-1, Yk 1,- - Yny. The 
quantity yë is the sampled value for the kth study when the estimation of the model 
parameters 6 = (v, T) is based on all studies but the Ath. Thus new values for the first study 
percent mortality difference are sampled when the likelihood for the cross-validation 
excludes that study and is based on all the other studies 2, 3,... If the model assump- 
tions are adequate, then the U; will be uniform over the interval (0, 1), and the quantities 


Table 10.17 Aspirin use: cross-validation assessment of meta-analysis 


Observed Data Cross validation 

Yi SI Predictive Predictive Predictive Predictive 

mean SD median probability 
UK1 2.77 1.65 1.03 1.97 0.96 0.782 
CDPA 2.5 1.31 0.97 1.94 0.86 0.779 
GAMS 1.84 2.34 1.24 1.96 1.17 0.590 
UK2 2.56 1.67 1.09 1.96 1.02 0.740 
PARIS 2.31 1.98 1.15 1.99 1.09 0.677 
AMIS —1.15 0.90 2.29 1.24 2.30 0.014 


Standard meta-analysis 


Mean St. devn. 2.5% Median 97.5% 
UKI1 vı 1.76 1.17 —0.21 1.67 4.34 
CDPA v 1.76 1.03 —0.07 1.68 3.97 
GAMS V3 1.40 1.29 —1.04 1.32 4.18 
UK2 V4 1.69 1.17 —0.33 1.60 4.23 
PARIS vı 1.56 1.21 —0.59 1.46 4.24 


AMIS v —0.11 0.95 —2.02 —0.08 1.65 
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Zr = O'(Uy) 


will be standard normal. A corresponding overall measure of adequacy is the Bonfer- 
roni statistic 


Q = Nmin(1 — |20; — 1|) 


which is an upper limit to the probability that the most extreme U; could be as large as 
was actually observed. 

One may also sample the predicted true study mean vě from the posterior density 
N (uj T based on excluding the kth study. This estimates the true mean for study 
k, had it not formed one of the pooled studies. 

Applying the cross-validation procedure (Program 10.13) shows that the Up for 
the AMIS study is in the lowest 2% tail of its predictive distribution (with predictive 
probability 1.4%). However, the Bonferroni statistic shows this may still be accept- 
able in terms of an extreme deviation among the studies, since Q = 0.17 (this is 
calculated from the posterior averages of the Up). There is clear evidence that the 
AMIS study true mean is lower than the others, but according to this procedure, it is 
not an outlier to such an extent as to invalidate the entire hierarchical meta-analysis 
model or its random error assumptions. The posterior means vě (the column headed 
predictive means in Table 10.17) show what the pooled mean p would look like in 
the absence of the kth study. The posterior mean vý for the AMIS study is about 2.29, 
with 95% interval 0.65 to 3.8, so that there is an unambiguous percent mortality 
reduction were this study not included in the pooling. 

One may also assess the standard meta-analysis against a mixture of Normals 


vi ~ N(tg, T) 


where the latent group G; is sampled from a probability vector m of length 2, itself 
assigned a Dirichlet prior with elements 1. With the constraint that w, > w4, this prior 
yields estimates m2 = 0.61 and a credible interval for m, that is entirely positive. The 
probability that G; is 2 exceeds 0.6, except for the AMIS study where it is only 0.26. In 
fact, this model has a lower DIC than the standard meta-analysis (around 25 as 
compared to 25.8). 


10.4.2 Heterogeneity in patient risk 


Apparent treatment effects may occur because trials are not exchangeable in terms 
of the risk level of patients in them. Thus, treatment benefit may differ according 
to whether patients in a particular study are relatively low or high risk. Suppose 
outcomes of trials are summarised by a mortality log odds (x;) for the control group 
in each trial, and by a similar log odds y; for the treatment group. A measure 
such as d; = y; — x; is typically used to assess whether the treatment was beneficial. 
Sometimes the death rate in the control group of a trial, or some transformation of 
it such as x;, is taken as a measure of the overall patient risk in that trial, and the 
benefits are regressed on x; to control for heterogeneity in risk. Thompson et al. 
(1997) show that such procedures induce biases due to inbuilt dependencies between 
di and xi. 

Suppose instead the underlying patient risk in trial i is denoted p; and the treatment 
benefits as v;, where these effects are independent. Assume also that the sampling errors s? 
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are equal across studies and across treatment and control arms of trials, so that 
var(x;) = var(y;) = o°. Then, assuming normal errors, one may specify the model 


Vi=Pi Vi Ui 
Xi =P; Mhi 


where u; and uz; are independent of one another, and of p; and v;. 

The risks p; may be taken as random with mean R and variance on Alternatively, 
Thompson et al. take o? as known (e.g. o? = 10 in their analysis of sclerotherapy trials), 
so that the p, are fixed effects. The v; may be distributed around an average treatment 
effect u, with variance 7’. 

Another approach attempts to model interdependence between risk and effects. For 
example, a linear dependence might involve 


VOS N(pi, 7) 
B= a B(p; R) 


and this is equivalent to assuming the v; and p; are bivariate Normal. 


Example 10.14 AMI and magnesium trials These issues are illustrated in the analysis 
by McIntosh (1996) of trials into the use of magnesium for treating acute myocardial 
infarction. For the nine trials considered, numbers of patients in the trial and control 
arms N; and Na vary considerably, with one trial containing a combined sample 
(Ni = Ni Nai) exceeding 50000, another containing under 50 (Table 10.18). 


It is necessary to allow for this wide variation in sampling precision for outcomes 
based on deaths r; and rą in each arm of each trial. McIntosh seeks to explain 
heterogeneity in treatment effects in terms of the control group mortality rates, 
Yn = Ma = rei / Nei. Treatment effects themselves are represented by the log odds ratio 


Yi, = log (my /Mci) 


To reflect sampling variation, McIntosh models the outcomes Yı and Y, as bivariate 
normal with unknown means 6;, 1:2 but known dispersion matrices };. The term oj); in 
X; for the variance of Y; is provided by the estimate 


1/{ Nami — my)} Namal — me)} 


while the variance for Yp is just the usual binomial variance. The covariance 04); is 
approximated as —1/N_;, and hence the ‘slope’ relating Y; to Yn in trial i is estimated as 
012;/022;. Table 10.18 presents the relevant inputs. Then the measurement model as- 
sumed by McIntosh is 


Y; 1:2 ~ N2(0i, 12, È) 


where 0; = v;, 82 = p;. One might consider a Multivariate ¢ to assess sensitivity. 
The true treatment effects v;, and true control group mortality rates, p;, are then 
modelled as 


vir N(u;; T°) 
p; ~ N(R, 2) 
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with p,; =a B(p;— R). If B is negative, this means that treatment effectiveness in- 
creases with the risk in the control group, whereas B = 0 means the treatment effect is 
not associated with the risk in the control group. The average underlying odds ratio for 
the treatment effect (controlling for the effect of risk) is obtained by exponentiating p. 

Inferences about B and p = exp (a) may be sensitive to the priors assumed for the 
variances 7” and o2. We consider three options for the inverse variances 1/7”, namely a 
G(3, 1) prior (see above) and a more diffuse G(1, 0.001) option. The prior on 1 /o% is 
kept at G(1, 0.001) throughout. The posterior estimate"? of B declines as the informa- 
tiveness of the prior on 1/7? increases, with the probability that B is positive being 
highest (around 29%) under the G(3, 1) prior, and lowest (3%) under G(1, 0.001) priors. 
Hence, only under diffuse priors on 1/77 is the treatment effect associated with the risk 
in the control group. The treatment odds ratio has a mean of around 0.62 with 95% 
interval {0.30, 1.13} under the G(3, 1) prior on 1/7? and 0.74 {0.44, 1.10} under the 
G(1, 0.001) priors. Taking a multivariate Student ¢ for Y;, 1:2 affects inferences relatively 
little, tending to reduce the chance of B being positive slightly; the degrees of freedom 
(with a uniform prior between 1 and 100) has a posterior mean of 51. 

An alternative analysis follows Thompson et al. in taking the observed r; and re; as 
binomial with rates my and ma in relation to trial populations N; and Nei. Thus 


ri ~ Bin(ttya, Ni) 
Fei ~ Bin(Tci, Nai) 


The models for y; = logit(m;) and x; = logit(m«;) are then 


Yi= P; Vi 
Xi = P; 


where the average trial risks p; may be taken as either fixed effects or random. Under the 
fixed effects model we take o? = 1, while under the random effects model it is assumed 
that 1 /o; ~ G(1, 0.001). The gain effects are modelled as above, 


vi~ Nw, 7) 
Bb; =a B(p; — R) 


Under the fixed effects option for p; and 1/1” ~ G(1, 0.001) we obtain a probability of 
around 17% that B exceeds zero. Under random effects for p;, inferences about B are 
sensitive to the prior assumed for 1/7”, as under the McIntosh model. Even for the more 
diffuse option, 1/7? ~ G(1, 0.001) there is a 12% chance that B > 0, while for 
1/1? ~ G(3, 1) the chance that B > 0 is 36%. It may be noted that the more informative 
prior is associated with a lower DIC. As above, neither prior gives an overall treatment 
odds ratio @ with 95% interval entirely below 1. 


10.4.3 Multiple treatments 


The usual assumption in carrying out a meta-analysis is that a single intervention or 
treatment is being evaluated. The studies are then all estimating the same parameter, 
comparing the intervention with its absence, such as an effect size (standardised differ- 
ence in means), relative risk or odds ratio. 


10 Two chain runs showed convergence at around 10000 iterations and summaries are based on iterations 
10 000-20 000. 
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However, in some contexts there may be a range of r treatment options, some 
studies comparing Treatment 1 to a control group, some studies comparing Treatment 
2 to a control group, and some studies involving multi-treatment comparisons (control 
group, Treatment 1, Treatment 2, etc.). One may wish to combine evidence over 
i=1,..n such studies, to assess the effectiveness of treatments j = 1,..r against no 
treatment (the placebo or control group is not considered a treatment), and to 
derive measures such as odds ratios comparing treatments j and k in terms of effective- 
ness. 


Example 10.15 MI prevention and smoking cessation Hasselblad (1998) considers an 
example of three studies for short-term prevention of heart attack (myocardial infarc- 
tion) using aspirin and heparin as possible alternative treatments. The outcome rates in 
the studies were five day MI rates, with only one study comparing r = 2 options with the 
placebo. Thus, the Theroux et al. (1988) study included no treatment (118 patients, of 
whom 14 had attacks): 


è treatment 1: aspirin (four out of 121 patients having an MJ); 
e treatment 2: heparin (one from 121 patients had an MI). 


The second study compared only aspirin with a placebo group, and the third study 
compared only heparin with a placebo group. There are then a total of A = 7 treatment 
or placebo arms over the three studies. 

Because of the small number of studies, a random effects model is not practical. 
Following Hasselblad, the observations from each arm of each study are modelled in 
terms of study risk effects p;, i= 1,..n (the log odds of MI in the control group), and 
treatment effects B;, j= 1,..r. The odds ratios OR; = exp (B; — Bx) then compare 
heparin and aspirin, while the odds ratios ; = exp (B;), j = 1,..r compare the treat- 
ments with the placebo. 

With N(0, 100) priors on all parameters, we find that the 6; are unambiguously below 
unity, so both heparin and aspirin can be taken to reduce short term MI mortality. The 
95% interval for odds ratio OR» just straddles unity (Table 10.19), and as Hasselblad 
(1998) says, is ‘more suggestive of a beneficial effect of heparin over aspirin than that 
from the Theroux study alone.’ 

A larger comparison by Hasselblad on similar principles involves 24 studies, 
evaluating smoking cessation programmes (and treatment success measured by odds 
ratio over 1). The control consisted of no contact, and there were three treatments: self- 
help programs, individual counselling, and group counselling. The majority of studies 
compare only one treatment with the placebo (e.g. individual counselling vs no contact) 
or two treatments (e.g. group vs. individual counselling), but two studies have three 
arms. One (Mothersill et al., 1988) compares the two counselling options with self-help. 
Hence, there are A = 50 binomial observations. 


Table 10.19 Multiple treatment comparison 


Mean St. devn. 2.5% Median 97.50% 


OR2 0.40 0.25 0.09 0.34 1.05 
o, 0.36 0.12 0.17 0.34 0.63 
b> 0.13 0.07 0.03 0.12 0.30 
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We assume a random study effect modelling cessation over all options including no 

contact, 
Pi ~ N(R, o°) 

where R is the grand cessation mean. This random variation is clearly present as shown 
by the credible interval (0.21, 0.78) on o?. The odds ratios , and ¢3 on the counselling 
options are clearly significant (Table 10.20), and that on self-help , suggests a benefit 
over no contact. The counselling options are in turn more effective than self-help. 

To assess prior sensitivity, a two group mixture on p; is adopted with 


Pi ~ N(R, oG) 


and G; denoting the latent group. This leads (Model C in Program 10.15) to low and 
high cessation rate studies being identified with posterior mean for R (the high 
cessation group) being —1.15 against R = —3.35. The treatment odds ratios are little 
changed however: ¢, and ¢; are slightly raised, OR» is slightly reduced. 


10.4.4 Publication bias 


The validity of meta-analysis rests on encompassing all existing studies to form an 
overall estimate of a treatment or exposure effect. A well known problem in this 
connection is publication bias, generally assumed to take the form of more significant 
findings being more likely to be published. Insignificant findings are, in this view, more 
likely to be relegated to the ‘file drawer’. One may attempt to model this selection 
process, and so give an indication of the bias in a standard meta-analysis based only on 
published studies. There is no best way to do this, and it may be advisable to average 
over various plausible models for publication bias. 

A common approach to this problem is to assume differential bias according to the 
significance of studies. Assume each study has an effect size Y; (e.g. log odds ratios or 
log relative risks) and known standard error sj, from which significance may be assessed. 
Then, studies in the most significant category using simple p tests (e.g. p between 0.0001 
and 0.025) have highest publication chances, those with slightly less significance (0.025 
to 0.10) have more moderate publication chances, and the lowest publication rates are 
for studies which are ‘insignificant’ (with p > 0.10). Hence, if a set of observed (i.e. 
published) studies has Nj, N2, and N3 studies in these three categories, and there are 
Mı, M and M; missing (unpublished) studies in these categories, then the true number 
of studies is {7}, Tə, T3} where Ti = Mı Ni, h = M2 No, and T; = M3 N3 and 
T3/N3 > T/N > Tı/Nı. 


Table 10.20 Smoking cessation: parameter summary 


Mean St. devn. 2.50% Median 97.50% 


OR2 1.69 0.22 1.30 1.68 2.16 
OR31 1.92 0.36 1.31 1.89 2313 
OR32 1.14 0.19 0.81 1.13 1.55 
R —2.42 0.14 —2.69 —2.42 —2.15 
On 1.29 0.16 1.01 1.28 1.62 
b> 2.16 0.12 1.92 2.15 2.41 
3 2.46 0.42 1.73 2.43 3.36 


o? 0.42 0.15 0.21 0.39 0.78 
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The objective is to estimate an overall effect u of exposure or treatment from the 
observed (i.e. published) effects. Suppose the Y, are log relative risks, so that u = 0 
corresponds to zero overall effect. For a Normal random effects model this is defined by 


Y; ~ N(vy, s>) 
v ~ N(p, 1°) 


where the v; model heterogeneity between studies. Given uninformative priors on 7” and 
u, the posterior density of u is essentially a normal density, with mean given by a 
weighted average of observed relative risks Y; and the prior relative risk of zero, with 
respective weights 


wy =1/s;/(\/s; 1/7] 


on Y; and 1 — w; on zero. To allow for publication bias, one may modify this scheme so 
that weights also depend upon significance ratios | Y;/s;|. 

Silliman (1997) proposes one scheme which in effect weights up the less significant 
studies so that they have a disproportionate influence on the final estimate of the 
treatment or exposure effect — one more in line with the distribution of the true number 
of studies. Suppose there are only two categories of study, those with higher significance 
probabilities, and those with lower significance. Then introduce two random numbers 
u; and u2, and assign weights W; = max (u1, u2) and Wz = min (u1, u2). This broadly 
corresponds to modelling the overall publication chance and the lesser chance attached 
to a less significant study. An alternative is to take Wı ~ U(0, 1) and Wz ~ U(0, W1). 
Let G; = 1 or 2 denote the significance category of study j. Then an additional stage is 
included in the above model, such that 


Y; ~ Ny, s) 
vj = y;/ We, (10.17) 
Vj ~ N(p, 7°) 


Givens et al. (1997) propose a scheme which models the number of missing studies in 
each significance category, i.e. Mı and Mh in the above two category example. They 
then introduce data augmentation to reflect the missing effects comparable to Y;, 
namely Z1;,7=1,..,M in the high significance category, and Zy, j= 1,.., M2 in 
the lower significance category. The missing study numbers M, and Mp are taken as 
negative binomial 


M; ~ NB(gj, Nj) 


where the prior for g; would reflect the higher publication chances in the high signifi- 
cance category; the Givens et al. simulation in Example 10.16 described below took 
gı ~ U(0.5, 1) and g2 ~ U(0.2, 1). The missing effects are generated in a way consistent 
with their sampled category'!. 


Example 10.16 Simulated publication bias We follow Givens et al. in generating a set 
of studies, only some of which are observed (published) subject to a known bias 
mechanism. Thus, the original known variances s of 50 studies are generated according 


11 Implementing the Givens et al. approach in WINBUGS is limited by M, and Mz being stochastic indices, 
and the fact that for loops cannot be defined with stochastic quantities. 
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to s ~ G(3, 9). The variance of the underlying study effects is t? = 0.03. These variance 
parameters are in fact close to those of a set of observed studies on excess lung cancer 
rates associated with passive smoking. A standard meta-analysis of all 50 studies (taking 
the s? as known) then gives an estimated overall relative risk, RR = exp (p) of 1.01 with 
95% interval from 0.86 to 1.18. The priors used on p and 7? are as in Givens et al. (1997, 
p. 229). 


To reflect the operation of publication bias, a selective suppression is then applied. 
The 26 ‘positive’ studies with 


| ¥j/sj| > 0 


and significance rates p therefore between 0 and 0.5 are retained in their entirety. The 24 
negative studies with | Y;/s;| < 0 are subjected to a 70% non-publication rate. In prac- 
tice, this led here to retaining seven of the 24 negative studies and all the 26 positive 
studies (the seven retained had uniformly generated numbers exceeding 0.7, while non- 
publication applies to those with numbers under 0.7). So Mı =0 and M,=17. A 
standard meta-analysis of this set of 33 studies gives an underlying central relative 
risk exp (wp) of 1.26 with 95% interval from 1.17—1.51. This is considerably in excess of 
the ‘true’ RR in the data set of all 50 studies. 

We then use the comparison of uniforms method of Silliman to compensate for bias, 
and the two mechanisms described above (Model A). Using the first mechanism gives an 
estimated mean RR of 1.18, and 95% interval from 1.005—1.39. This approach gives a 
clearer basis for doubting that the underlying RR over the studies exceeds unity. With 
the second mechanism, one obtains a posterior mean RR of 1.12 with 95% interval from 
1.02-1.23. 

To assess sensitivity to priors on the underlying study effects (Smith et al., 1995) an 
alternative Student ¢ prior is taken, in combination with the second mechanism, such 
that v; ~ ts(w, 1°). This gives a mean RR of 1.11 with 95% interval from 1.006-1.23, so 
that conclusions are unaltered. Similar models might be envisaged, for example regres- 
sion models for the publication probability in Equation (10.17) with a coefficient on the 
study significance ratio constrained to be positive: 


Y; ~ N(y, s>) 
v= y/W; 
logit(W;) = BY;/s; 
B ~ NO, 1) 
Applying this model here (Model B) gives an interval on RR of {0.99, 1.26}. 


10.5 REVIEW 


Bayesian epidemiology has drawn on wider ideas and developments in Bayesian statis- 
tics, but is oriented to specific concerns such as arise in the analysis of disease risk and 
causation. These include control for confounding influences on disease outcome where 
the confounder affects disease risk, and is also unequally distributed across categories of 
the main exposure; allowing for measurement errors in disease risk or outcome, perhaps 
drawing on information from calibration studies (Stephens and Dellaportas, 1992); the 
delineation of disease and risk factor distributions over time, attributes of individuals 
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(e.g. age) and place (Ashby and Hutton, 1996); and the tailoring of hierarchical methods 
for combining information to the meta-analysis of medical intervention or risk factor 
studies. Because epidemiological applications often focus on the impact of well docu- 
mented risk factors, framing of priors often involves elicitation of informative priors; 
this is so especially in clinical epidemiology in the sense of models for randomised trials, 
diagnostic tests, etc, as illustrated by ranges of priors (sceptical, neutral enthusiastic, 
etc.) possible in clinical trial assessment (Spiegelhalter et al., 1999) or the use of informa- 
tive priors in gauging diagnostic accuracy (Joseph et al., 1995). 

The above chapter has been inevitably selective in coverage of these areas. Thus while 
state space random walk models have been illustrated in Example 10.6 (and the 
modelling of place effects in disease outcomes in Chapter 7) more complex examples, 
in terms of identifiability issues, occur in disease models with age, period and cohort 
effects all present; methodological issues in this topic are discussed by Knorr-Held 
(2000). Recent work on measurement error modelling in epidemiology includes 
Richardson et al. (2001) 
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EXERCISES 


1. In Example 10.2, try the square root transforms instead of the log, transforms of age 
and SBP in the coefficient selection procedure (or include both square root and log, 
transforms). Then fit a risk model using the most frequently selected terms and assess 
its fit. 

2. In Example 10.4, replicate and assess the sensitivity of the analysis into illness and 
obesity by using (a) correlated physical and mental health status factors, and (b) 
using robust (heavy tailed) alternative for the density of the factors. 


3. In Example 10.5 try adding a square root term rather than a square in rads to the 
power model with continuous outcome. Does this improve fit? 


4. In Model A in Example 10.6, fit a logistic regression with categories under 130, 
130-149, 150-169, 170-189, 190-209 and 210 and over, and then fit a weighted 
quadratic regression to the odds ratios (as in Model A2 in Program 10.6). How 
does this illustrate the nonlinear impact of SBP? 


5. In the flour beetle mortality Example 10.7, consider the generalisation for dosage 
effects of the model 


Ry = CG, X-) — C(tj-1, Xr) 
where 
ct,)=) AFA rg! 


and where F(X)= exp(Bı Blog ¥)=exp(®,)X*. For identiability a prior 
à ~ U(0, 1) may be used with à = | corresponding to the proportional odds model 
actually used in Example 10.7, and \ = 0 giving a Weibull model, when 

C4, X) =exp(—e" X™) [1p 


6. Following Example 10.11, carry out a meta-analysis of the mortality outcomes 
following CABG for high risk patients, where the odds ratios and their confidence 
limits for seven studies are 

OR = 0.58, 0.37, 0.43, 0.56, 0.27, 1.89, 0.95 
LCL = 0.33, 0.15, 0.15, 0.19, 0.05, 0.31, 0.23 
UCL = 1.01, 0.89, 1.26, 1.63, 1.45, 11.64, 3.83 


7. In Example 10.12, apply a Bayesian significance test to obtain the probabilities that 
the v; are positive. 


8. In Example 10.14, assess sensitivity of inferences to adopting a Student t prior rather 
than Normal prior for the study treatment effects, 5;, with degrees of freedom an 
extra parameter (in both the McIntosh and Thompson et al. models). 
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abortion, attitudes to 149-51 
Academic Self-Concept (ASC) 327, 329 
Accelerated Failure Time (AFT) model 
370-2 
accelerated hazards 370-2 
Activities of Daily Living (ADL) 201, 369, 
377-8 
acute myocardial infarction (AMI) 
and magnesium trials 437-9 
see also myocardial infarction 
Additional Attention Deficit (AD) 365 
adolescent self-concept 327 
confirmatory factor analysis 328 
adoption of innovations 314-15 
AFF 287, 288, 302-4 
Affective Perception Inventory 328 
age effects in spinal surgery 116-18 
age stratification, alcohol consumption and 
oesophageal cancer with 401-4 
Age-Period-Cohort (APC) models 310 
ageing, longitudinal study of 377-8 
agricultural subsistence, and road 
access 282-4 
AIDS risks 338-40 
Akaike Information Criterion (AIC) 32-3, 40, 
63, 97, 115, 379 
alcohol consumption and oesophageal cancer 
with age stratification 401-4 
alienation over time 354-6 
AMIS study 435-6 
annuitant deaths 262-4 
ante-natal knowledge 347-9 
posterior parameter summaries 348 
arbovirus injection 427-8 
ARCH model 205, 210-13 
area level trends 312-13 
areas vs. case events as data 306 
ARIMA models 185 
ARMA coeffecients 178 
ARMA models 171, 173-4, 193, 215 
metric outcome 194 
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Armed Services Vocational Aptitude Battery 
Test (ASVAB) 379 
Arterial Road Accessibility (ARA) 282-4 
aspirin, predictive cross-validation for 
meta-analysis 433-6 
asymmetric series 206-7 
attempted suicide in East London 296-8 
attitudes to abortion 149-51 
autoregression on transformed outcome 
193 
Autoregressive Distributed Lag (ADL or 
ARDL) model 173, 201 
Autoregressive Latent Trait (ALT) 
models 230, 231 
autoregressive models 172-91, 208, 215 
panel data analysis 230 
without stationarity 175-6 


baseball, shooting percentages 46-7 

Basic Structural Model (BSM) 207-8 

Bayes factors 88, 90, 91, 115, 200 

Bayes Information Criterion (BIC) 32-3, 40, 
97, 182, 362 

Bayesian Cross Validation (BCV) method 96, 
97 

Bayesian model estimation via repeated 
sampling 1-30 

Bayesian regression 79-84 

Bernoulli density 13, 17 

Bernoulli likelihood model 385 

Bernoulli priors 86 

Bernoulli sampling 338, 377, 423-5 

beta-binomial mean proportion 424 

beta-binomial mixture for panel data 244-6 

beta-density 15 

binary indicators 403 

for mean and variance shifts 215-16 

binary outcomes 282 

binary selection models for robustness 
119-20 

binomial data 43-4 

binomial density 13, 17 
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binomial sampling 423-5 
biostatistics, Bayesian approaches to 
modelling 397 
bivariate factor model 339 
bivariate forecasts 189-91 
bivariate mortality outcome with correlated 
errors 289 
bivariate Poisson outcomes 160-3 
bladder cancer 391-3 
BMI 411-13 
boric acid exposure 423-5 
breast cancer 
and radiation 417-18 
survival 371-2 


Caesarian section, infection after 103-4 
case-control analysis, larynx cancer 
407-9 
case-control data, oesophageal cancer 402 
causal processes 397-9 
specific methodological issues 398-9 
causal relations 397-447 
causality, establishing 397-9 
central limit theorem 37 
chain trace plots 162 
CHD, Framingham study 404-7, 418-22 
chi-square 15 
Cholesky decomposition 16 
CIS (Clinical Interview Schedule) 336-8 
clinal clustering model 307 
clustered metric data, latent trait models 
for 343-4 
clustering 
effects on dose-response models 416 
in relation to known centres 304-10 
coaching programs 56-8 
coal consumption, UK 207-8 
coal production, US 185-8 
co-integrated series 201 
colon cancer survival, non-parametric smooth 
via equally weighted mixture 375 
colorectal cancer 
Kaplan-Meier method 373-4 
survival and hazard rates at distinct death 
times 374 
Columbus, Ohio, crime data 284-6, 301-2 
compositional effects 136 
conditional autoregression approach 288 
conditional autoregressive (CAR) prior 276, 
277, 282, 287-9 
conditional error model 262 
conditional likelihood 175 
conditional logit model 99, 101 


Conditional Predictive Ordinate (CPO) 87, 90, 
106-7, 115, 183, 219, 238, 284, 286, 289, 
362, 371 
harmonic mean estimator 40 
statistics 182 
conditional priors 293 
vs. joint priors 276-8 
conditional specification of spatial error 293-8 
confirmatory factor analysis models 324 
adolescent self-concept 328 
with a single group 327-33 
confounding between disease risk 
factors 399-413 
conjugate mixture approach 246-8 
consumption function for France 219-21 
contaminated sampling, group means 
from 121 
contamination/misclassification model 125 
contextual effects 136 
continuity parameter models 195 
continuous data 
latent trait models for 341 
smoothing methods for 51-3 
with fixed interaction schemes, spatial 
regressions models 275-8 
continuous outcomes 229 
continuous time functions for survival 
363-70 
convergence 
assessing 18-20 
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of Gelman-Rubin criteria 72 
convolution model 279, 281 
Coronary Artery Bypass Graft (CABG), 
survival after 431-2 
correlated error model 285 
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count regression 83 
counting process models 388-93 
counts, integer valued autoregressive (INAR) 
models for 193-5 
covariance modelling in regression 290-1 
covariances between factors 329 
Cox proportional hazards model 389 
crime rate data 284-6 
cross-fertilisation in plants, Darwin data 
on 53-4 
cross-level effect 144 
cross-sectional data 259 
cross-validation methodology 96, 126 
cross-validation regression model 
assessment 86-8 
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cumulative mortality in relation to 
dose-time 422-3 
cumulative odds model 101 


Darwin data on cross-fertilisation in 
plants 53-4 
demographic data by age and time 
period 261-2 
dental development data 259-61 
dependent errors 173-4 
depression state, changes in 345-7 
deviance criterion 94 
Deviance Information Criterion (DIC) 33, 45, 
71-2, 182, 236, 238, 239, 250, 254, 256, 
303, 304, 392, 436, 439 
diagnostic tests, ordinal ratings in 108-10 
differential equation prior 117 
diffusion processes 314 
direct modelling of spatial covariation 289-98 
Dirichlet process mixture 239 
Dirichlet process priors (DPP) 60-1, 75, 82, 
155 
Dirichlet processes 58—67 
discontinuities in disease maps 281 
discordant observations, diagnostics for 120-1 
discrete mixture model 154 
discrete mixtures 58—67 
discrete outcomes 334 
multi-level models 139-40 
spatial covariance modelling for 296-8 
spatial effects for 278-89 
time-series models 191—200 
discrete parametric mixtures 58—60 
discrete time approximations 372-82 
discrete time hazards regression 375-6 
discrete time regression, proportional hazards 
in 384-7 
disease models, spatial priors in 279-81 
disease risk factors, confounding 
between 399-413 
disease risks, entangling or mixing of 398 
distance decay functions 291, 305 
distance decay parameter 310 
distribution function 243 
dose-response models 398, 413-28 
clustering effects on 416-17 
compliance and response 425-6 
dose-response relations, background 
mortality 427 
dose-time, cumulative mortality in relation 
to 422-3 
Dukes’ colorectal cancer 374 
Durbin—Watson statistic 183 
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Dutch language tests 142 
dynamic linear models 203-9 


ECM models 201-3 

ecological analysis involving count 
data 278-89 

economic participation among working age 
women 165-7 

EDA method 291 

England vs. Scotland football series 197-8 

ensemble estimates 41-58 

environmental pollution, multiple sources 
of 306-7 

epidemiological analysis of small area disease 
data 278 

epidemiological methods and models 
397-447 

epidemiological modelling, Bayesian 
approaches 397 

EQS 326 

error correction models 200-3 

ethylene glycol monomethyl ether 
(EGMME) 350-2 

event history models, accounting for frailty 
in 384-8 

exchange rate series volatility 213-15 

exponential mixtures for patient length of stay 
distributions 64-7 

exponential samples 14-15 

factor structures at two levels 349-50 

fathers’ occupation and education 353-4 

fertility cycles to conception 48-51 

fixed effects models 249 

flour beetle mortality 423 

forecasting, panel data analysis 257-64 

Framingham study 404-7, 418-22 

France, consumption function for 219-21 


gamma density 15, 196, 243, 375 
gamma Markov process 196 
gamma process priors 381-2 
gamma-Poisson heterogeneity 255 
gamma-Poisson mixture 95 
gamma-Poisson model 254 
gamma variables 253 
GARCH model 205, 210-11, 213 
Gaussian correlation function 290 
Gaussian distance decay function 302 
Geisser-Eddy cross-validation method 90 
Gelman-—Rubin criteria 23 

convergence of 72 
Gelman-—Rubin statistic 19, 162 
gender, language score variability by 147-9 
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General Additive Models (GAM) for nonlinear 
regression effects 115-18 

General Linear Models (GLMs) 82 

General Motors 188 

General Self-Concept (GSC) 327 

general variance function 94 

Geographically Weighted Regression 
(GWR) 299-302, 314 

Gessel score 26-7 

GHQ (General Health Questionnaire) 336-8 

Gibbs sampling 5-12 

Glasgow deaths 288-9 

GLS 137 

GNP series 184, 185, 186 

goodness of fit criteria 96 

group factor model, extension of 328 

group means from contaminated sampling 121 

group survival curves 368 

growth curve 

variability 232-3 
with random variation in trajectory 
parameters over subjects 234 

growth curve analysis, plasma citrate 
readings 235-6 

growth curve model, variable lag effects 242 


HADS (Hospital Anxiety and Depression 
Scale) 336-8 
Hald data 91-2 
harmonic mean estimates 89 
hazard function 389 
hazard profile 392-3 
hemorrhagic conjunctivitis 70-2 
hepatitis B in Berlin regions 44-5 
heterogeneity 
between subjects 228-9 
in patient risk 436-7 
heterogeneity modelling 43-4, 245, 254 
regression parameters and estimated firm 
effects 255 
heteroscedasticity 302-4 
in level 1 language score variances 148 
modelling 145-51 
hierarchical mixture models 31-78 
histogram smoothing 69-70 
homoscedastic errors 282 
homoscedasticity 285 
hospital profiling application 158-60 
hospitalisations 
following self-harm 296-8 
mental illness 72-4 
of schizophrenic patients 97 
hypertension trial 236-42 
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alternative models 238 
multi-level model, parameter summaries 241 


ICAR models 280-2, 287, 288, 293, 298, 301, 
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imputation approach 253 

indicator-factor loadings 329 

infection after Caesarian section 103-4 

influence checks 22-3 

informative priors 81 

innovations, adoption of 314-15 

integer valued autoregressive (INAR) models 
for counts 193-8 

inter-urban moves 247 

Intrinsic Bayes factor 40 

inversion method 13-14 

invertibility/non-invertibility 177-9 

investment levels by firms 188-9 


joint priors vs. conditional priors 276-8 
Joint Schools Project (JSP) 153-6 


Kaplan—Meier method 372, 373 
colorectal cancer 373-4 

Kernel plot of directional parameter 311 

Kullback-Leibler divergence 41 

kyphosis and age data 116-18, 127-8 


labour market rateable values 55-6 
lamb fetal movement counts, times series 
of 217-18 
language scores 
in Dutch schools 140-3 
variability by gender 147-9 
Language Skills Self-Concept (LSC) 327, 329 
Laplace approximation 84 
to marginal likelihood 97 
Laplace priors 281 
larynx cancer 
and matched case-control analysis 407-9 
case data in Lancashire 308-10 
latent class models 335-8 
through time 341-2 
latent dependent variable method 117 
latent mixtures, regression models with 110-15 
latent structure analysis for missing 
data 352-6 
latent trait analysis 334-5 
latent trait models 
for clustered metric data 343-4 
for continuous data 341 
for mixed outcomes 344-5 
for time varying discrete outcomes 343 


INDEX 


latent variable effects, nonlinear and 
interactive 332-3 
latent variables in panel and clustered data 
analysis 340-52 
LCA model 338 
least squares estimate 152 
Left Skewed Extreme Value (LSEV) 
distribution 102 
leukemia 
cases near hazardous waste sites 304, 307-8 
cases near nuclear processing plants 304 
remission data 11-12, 367, 382-3, 390-1 
survival data 123-4 
likelihood model for spatial covariation 294-6 
likelihood vs. posterior density 3 
linear mixed model 234-5 
link functions 102-3 
lip cancers data 287-8, 302-4 
LISREL 324, 326, 335, 341 
log-logistic model 367 
log-normal heterogeneity 255 
logit model 124 
logit under transposition 124 
long term illness 143-5 
longitudinal discrete data 243-57 
longitudinal studies 
ageing 377-8 
missing data in 264-8 
lung cancer 
cases and controls by exposure 415 
death trends 160-3 
in London small areas 23-6 


M-estimators 119 
magnesium trials and acute myocardial 
infarction (AMI) 437-9 

Mantel-Haenszel (MH) estimator 400 
MAR assumption 352, 356 
marginal likelihood 

approximations 90 

identity 88 

in practice 35-7 

Laplace approximation to 97 

model selection using 33-4 
Markov chain model 230, 342 
Markov Chain Monte Carlo see MCMC 
Markov mixture models 216-18 
Markov models 342 

mixed 346 
Mathematics Self-Concept (MSC) 327-9 
maths over time project 153-6 
maximum likelihood estimates of SMRs 278-9 
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